<div style="border:solid blue 2px; padding: 20px">
  
**Hello James**

My name is Dima, and I will be reviewing your project. 

You will find my comments in coloured cells marked as 'Reviewer's comment'. The cell colour will vary based on the contents - I am explaining it further below. 

**Note:** Please do not remove or change my comments - they will help me in my future reviews and will make the process smoother for both of us. 

<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment</b> 
    
Such comment will mark efficient solutions and good ideas that can be used in other projects.
</div>

<div class="alert alert-warning"; style="border-left: 7px solid gold">
<b>⚠️ Reviewer's comment</b> 
    
The parts marked with yellow comments indicate that there is room for optimisation. Though the correction is not necessary it is good if you implement it.
</div>

<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment</b> 
    
If you see such a comment, it means that there is a problem that needs to be fixed. Please note that I won't be able to accept your project until the issue is resolved.
</div>

You are also very welcome to leave your comments / describe the corrections you've done / ask me questions, marking them with a different colour. You can use the example below: 

<div class="alert alert-info"; style="border-left: 7px solid blue">
<b>Student's comment</b>

# Yandex.Music

<div style="border:solid green 2px; padding: 20px">
    
<div class="alert alert-success">
<b>Review summary</b> 
    
James, thanks for submitting the project. You've done a very good job and I enjoyed reviewing it.
    
- You completed all the tasks.
- Your code was optimal and easy to read. 
- You wrote your own functions.
    
There are only a few critical comments that need to be corrected. You will find them in the red-colored cells in relevant sections. If you have any questions please write them when you return your project. 
    
I'll be looking forward to getting your updated notebook.

# Contents <a id='back'></a>

* [Introduction](#intro)
* [Stage 1. Data overview](#data_review)
    * [Conclusions](#data_review_conclusions)
* [Stage 2. Data preprocessing](#data_preprocessing)
    * [2.1 Header style](#header_style)
    * [2.2 Missing values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusions](#data_preprocessing_conclusions)
* [Stage 3. Testing the hypotheses](#hypotheses)
    * [3.1 Hypothesis 1: user activity in the two cities](#activity)
    * [3.2 Hypothesis 2: music preferences on Monday and Friday](#week)
    * [3.3 Hypothesis 3: genre preferences in Springfield and Shelbyville](#genre)
* [Findings](#end)

## Introduction <a id='intro'></a>
Whenever we're doing research, we need to formulate hypotheses that we can then test. Sometimes we accept these hypotheses; other times, we reject them. To make the right decisions, a business must be able to understand whether or not it's making the right assumptions.

In this project, you'll compare the music preferences of the cities of Springfield and Shelbyville. You'll study real Yandex.Music data to test the hypotheses below and compare user behavior for these two cities.

### Goal: 
Test three hypotheses:
1. User activity differs depending on the day of the week and from city to city. 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. This is also true for Friday evenings. 
3. Springfield and Shelbyville listeners have different preferences. In Springfield, they prefer pop, while Shelbyville has more rap fans.

### Stages 
Data on user behavior is stored in the file `/datasets/music_project_en.csv`. There is no information about the quality of the data, so you will need to explore it before testing the hypotheses. 

First, you'll evaluate the quality of the data and see whether its issues are significant. Then, during data preprocessing, you will try to account for the most critical problems.
 
Your project will consist of three stages:
 1. Data overview
 2. Data preprocessing
 3. Testing the hypotheses
 
[Back to Contents](#back)

<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 1</b> 
    
Title and introduction are essential parts of the project. Make sure you do not forget to include it in your further projects. 
    
It is optimal if introduction consists of:
    
- brief description of the situation;
- goal of the project;
- description of the data we are going to use.
</div>


## Stage 1. Data overview <a id='data_review'></a>

Open the data on Yandex.Music and explore it.

You'll need `pandas`, so import it.

In [1]:
import pandas as pd


<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 1</b> Needed library has been added </div>

Read the file `music_project_en.csv` from the `/datasets/` folder and save it in the `df` variable:

In [2]:
df = pd.read_csv('/datasets/music_project_en.csv')

<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 1</b> The correct path to the file is specified: the slash at the beginning of the path is very important, as it indicates that you need to search for the file in the root folder. </div>

Print the first 10 table rows:

In [3]:
first_10_rows = df.head(10) # obtaining the first 10 rows from the df table

In [4]:
print(df.head(10)) # obtaining general information about the data in df

     userID                        Track            artist   genre  \
0  FFB692EC            Kamigata To Boots  The Mass Missile    rock   
1  55204538  Delayed Because of Accident  Andreas Rönnberg    rock   
2    20EC38            Funiculì funiculà       Mario Lanza     pop   
3  A3DD03C9        Dragons in the Sunset        Fire + Ice    folk   
4  E2DC1FAE                  Soul People        Space Echo   dance   
5  842029A1                       Chains          Obladaet  rusrap   
6  4CB90AA5                         True      Roman Messer   dance   
7  F03E1C1F             Feeling This Way   Polina Griffith   dance   
8  8FA1D3BE                     L’estate       Julia Dalia  ruspop   
9  E772D5C0                    Pessimist               NaN   dance   

        City        time        Day  
0  Shelbyville  20:28:33  Wednesday  
1  Springfield  14:07:09     Friday  
2  Shelbyville  20:58:07  Wednesday  
3  Shelbyville  08:37:09     Monday  
4  Springfield  08:34:34     Monday  
5

Obtaining the general information about the table with one command:

<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment, v. 1</b> 
    
Please obtain the general information about the table here. You could use `info` or `describe`

The table contains seven columns. They all store the same data type: `object`.

According to the documentation:
- `'userID'` — user identifier
- `'Track'` — track title
- `'artist'` — artist's name
- `'genre'`
- `'City'` — user's city
- `'time'` — the exact time the track was played
- `'Day'` — day of the week

We can see three issues with style in the column names:
1. Some names are uppercase, some are lowercase.
2. There are spaces in some names.
3. `Detect the third issue yourself and describe it here`.

The number of column values is different. This means the data contains missing values.


<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment, v. 1</b> 
    
Unfortunately, you didn't answer the question (

Hint: remember the snake_case

### Conclusions <a id='data_review_conclusions'></a> 

Each row in the table stores data on a track that was played. Some columns describe the track itself: its title, artist and genre. The rest convey information about the user: the city they come from, the time they played the track. 

It's clear that the data is sufficient to test the hypotheses. However, there are missing values.

To move forward, we need to preprocess the data.

<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 1</b> 
    
Please note that it is highly recommended to add a conclusion / summary after each section and describe briefly your observations and / or major outcomes of the analysis.

[Back to Contents](#back)

## Stage 2. Data preprocessing <a id='data_preprocessing'></a>
Correct the formatting in the column headers and deal with the missing values. Then, check whether there are duplicates in the data.

### Header style <a id='header_style'></a>
Print the column header:

In [5]:
print(df.columns) # the list of column names in the df table

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')


Change column names according to the rules of good style:
* If the name has several words, use snake_case
* All characters must be lowercase
* Delete spaces

In [6]:
df = df.rename(columns={
    '  userID': 'user_id',
    'Track': 'track',
    'artist': 'artist',
    'genre': 'genre',
    '  City  ': 'city',
    'time': 'time',
    'Day': 'day'
})
# renaming columns

<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 1</b> 
    
This is a good way to rename the columns.

Check the result. Print the names of the columns once more:

In [7]:
print(df.columns) # checking result: the list of column names

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


[Back to Contents](#back)

### Missing values <a id='missing_values'></a>
First, find the number of missing values in the table. To do so, use two `pandas` methods:

In [8]:
print(df.isna().sum()) # calculating missing values

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64


<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 1</b> 
    
The isna() method is selected to find the missing values, it's great!

Not all missing values affect the research. For instance, the missing values in `track` and `artist` are not critical. You can simply replace them with clear markers.

But missing values in `'genre'` can affect the comparison of music preferences in Springfield and Shelbyville. In real life, it would be useful to learn the reasons why the data is missing and try to make up for them. But we do not have that opportunity in this project. So you will have to:
* Fill in these missing values with markers
* Evaluate how much the missing values may affect your computations

Replace the missing values in `'track'`, `'artist'`, and `'genre'` with the string `'unknown'`. To do this, create the `columns_to_replace` list, loop over it with `for`, and replace the missing values in each of the columns:

In [9]:
df['track'] = df['track'].fillna('unknown')
df['artist'] = df['artist'].fillna('unknown')
df['genre'] = df['genre'].fillna('unknown') 
# looping over column names and replacing missing values with 'unknown'

<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment, v. 1</b> 
    
Not quite right. According to the technical task we should:
    
* ...create the `columns_to_replace` **list**, **loop over it with for**, and replace the missing values in each of the columns:
    
So, we should create a list and loop here

Make sure the table contains no more missing values. Count the missing values again.

In [10]:
print(df.isna().sum()) # counting missing values

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64


[Back to Contents](#back)

### Duplicates <a id='duplicates'></a>
Find the number of obvious duplicates in the table using one command:

In [11]:
print(df.duplicated().sum()) # counting clear duplicates

3826


Call the `pandas` method for getting rid of obvious duplicates:

In [12]:
df = df.drop_duplicates()
# removing obvious duplicates

Count obvious duplicates once more to make sure you have removed all of them:

In [13]:
print(df.duplicated)# checking for duplicates

<bound method DataFrame.duplicated of         user_id                              track            artist  \
0      FFB692EC                  Kamigata To Boots  The Mass Missile   
1      55204538        Delayed Because of Accident  Andreas Rönnberg   
2        20EC38                  Funiculì funiculà       Mario Lanza   
3      A3DD03C9              Dragons in the Sunset        Fire + Ice   
4      E2DC1FAE                        Soul People        Space Echo   
...         ...                                ...               ...   
65074  729CBB09                            My Name            McLean   
65075  D08D4A55  Maybe One Day (feat. Black Spade)       Blu & Exile   
65076  C5E3A0D5                          Jalopiina           unknown   
65077  321D0506                      Freight Train     Chas McDevitt   
65078  3A64EF84          Tell Me Sweet Little Lies      Monica Lopez   

            genre         city      time        day  
0            rock  Shelbyville  20:28:33  W

<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 1</b> 
    
Great, you found and removed the duplicates. And did very thorough checks to make sure the duplicates are gone.

Now get rid of implicit duplicates in the `genre` column. For example, the name of a genre can be written in different ways. Such errors will also affect the result.

Print a list of unique genre names, sorted in alphabetical order. To do so:
* Retrieve the intended DataFrame column 
* Apply a sorting method to it
* For the sorted column, call the method that will return all unique column values

In [14]:
print(df['genre'].unique()) # viewing unique genre names

['rock' 'pop' 'folk' 'dance' 'rusrap' 'ruspop' 'world' 'electronic'
 'unknown' 'alternative' 'children' 'rnb' 'hip' 'jazz' 'postrock' 'latin'
 'classical' 'metal' 'reggae' 'triphop' 'blues' 'instrumental' 'rusrock'
 'dnb' 'türk' 'post' 'country' 'psychedelic' 'conjazz' 'indie'
 'posthardcore' 'local' 'avantgarde' 'punk' 'videogame' 'techno' 'house'
 'christmas' 'melodic' 'caucasian' 'reggaeton' 'soundtrack' 'singer' 'ska'
 'salsa' 'ambient' 'film' 'western' 'rap' 'beats' "hard'n'heavy"
 'progmetal' 'minimal' 'tropical' 'contemporary' 'new' 'soul' 'holiday'
 'german' 'jpop' 'spiritual' 'urban' 'gospel' 'nujazz' 'folkmetal'
 'trance' 'miscellaneous' 'anime' 'hardcore' 'progressive' 'korean'
 'numetal' 'vocal' 'estrada' 'tango' 'loungeelectronic' 'classicmetal'
 'dubstep' 'club' 'deep' 'southern' 'black' 'folkrock' 'fitness' 'french'
 'disco' 'religious' 'hiphop' 'drum' 'extrememetal' 'türkçe'
 'experimental' 'easy' 'metalcore' 'modern' 'argentinetango' 'old' 'swing'
 'breaks' 'eurofolk' 

<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment, v. 1</b> 

Please note, that according to the technical task it was asked:
    
* For the **sorted** column, call the method that will return all unique column values
    
So, we should use sorting here

Look through the list to find implicit duplicates of the genre `hiphop`. These could be names written incorrectly or alternative names of the same genre.

You will see the following implicit duplicates:
* `hip`
* `hop`
* `hip-hop`

To get rid of them, declare the function `replace_wrong_genres()` with two parameters: 
* `wrong_genres=` — the list of duplicates
* `correct_genre=` — the string with the correct value

The function should correct the names in the `'genre'` column from the `df` table, i.e. replace each value from the `wrong_genres` list with the value in `correct_genre`.

In [15]:
wrong_genres= ['hip', 'hop', 'hip-hop']
correct_genre = 'hiphop'
# function for replacing implicit duplicates

Call `replace_wrong_genres()` and pass it arguments so that it clears implicit duplcates (`hip`, `hop`, and `hip-hop`) and replaces them with `hiphop`:

In [16]:
def replace_wrong_values(wrong_values, correct_value):
    for wrong_value in wrong_values: # looping over misspelled names
        replace_wrong_genres(df, wrong_genres, correct_genre) # removing implicit duplicates

<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 1</b> 
    
Yes, this is what was needed!

Make sure the duplicate names were removed. Print the list of unique values from the `'genre'` column:

In [17]:
unique_genres = df['genre'].unique()
print(unique_genres)
# checking for implicit duplicates

['rock' 'pop' 'folk' 'dance' 'rusrap' 'ruspop' 'world' 'electronic'
 'unknown' 'alternative' 'children' 'rnb' 'hip' 'jazz' 'postrock' 'latin'
 'classical' 'metal' 'reggae' 'triphop' 'blues' 'instrumental' 'rusrock'
 'dnb' 'türk' 'post' 'country' 'psychedelic' 'conjazz' 'indie'
 'posthardcore' 'local' 'avantgarde' 'punk' 'videogame' 'techno' 'house'
 'christmas' 'melodic' 'caucasian' 'reggaeton' 'soundtrack' 'singer' 'ska'
 'salsa' 'ambient' 'film' 'western' 'rap' 'beats' "hard'n'heavy"
 'progmetal' 'minimal' 'tropical' 'contemporary' 'new' 'soul' 'holiday'
 'german' 'jpop' 'spiritual' 'urban' 'gospel' 'nujazz' 'folkmetal'
 'trance' 'miscellaneous' 'anime' 'hardcore' 'progressive' 'korean'
 'numetal' 'vocal' 'estrada' 'tango' 'loungeelectronic' 'classicmetal'
 'dubstep' 'club' 'deep' 'southern' 'black' 'folkrock' 'fitness' 'french'
 'disco' 'religious' 'hiphop' 'drum' 'extrememetal' 'türkçe'
 'experimental' 'easy' 'metalcore' 'modern' 'argentinetango' 'old' 'swing'
 'breaks' 'eurofolk' 

[Back to Contents](#back)

### Conclusions <a id='data_preprocessing_conclusions'></a>
We detected three issues with the data:

- Incorrect header styles
- Missing values
- Obvious and implicit duplicates

The headers have been cleaned up to make processing the table simpler.

All missing values have been replaced with `'unknown'`. But we still have to see whether the missing values in `'genre'` will affect our calculations.

The absence of duplicates will make the results more precise and easier to understand.

Now we can move on to testing hypotheses. 

[Back to Contents](#back)

## Stage 3. Testing hypotheses <a id='hypotheses'></a>

### Hypothesis 1: comparing user behavior in two cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville listen to music differently. Test this using the data on three days of the week: Monday, Wednesday, and Friday.

* Divide the users into groups by city.
* Compare how many tracks each group played on Monday, Wednesday, and Friday.


For the sake of practice, perform each computation separately. 

Evaluate user activity in each city. Group the data by city and find the number of songs played in each group.



In [18]:
city_group = df.groupby('city')
city_track_counts = city_group['track'].sum()
print(city_track_counts) # Print the track counts for each city
# Counting up the tracks played in each city

city
Shelbyville    Kamigata To BootsFuniculì funiculàDragons in t...
Springfield    Delayed Because of AccidentSoul PeopleTrueFeel...
Name: track, dtype: object


<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment, v. 1</b> 

Not quite right. Please, just try to use groupby in one line, like:
    
    df.groupby(['city'])['track'].count()

Springfield has more tracks played than Shelbyville. But that does not imply that citizens of Springfield listen to music more often. This city is simply bigger, and there are more users.

Now group the data by day of the week and find the number of tracks played on Monday, Wednesday, and Friday.


In [19]:
day_group = df.groupby('day')
daily_track_counts = day_group['track'].sum()
print(daily_track_counts)
# Calculating tracks played on each of the three days

day
Friday       Delayed Because of AccidentChainsL’estateAfter...
Monday       Dragons in the SunsetSoul PeopleGool la MitaIs...
Wednesday    Kamigata To BootsFuniculì funiculàTrueFeeling ...
Name: track, dtype: object


<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment, v. 1</b> 

And here again, please just use groupby together in one line

Wednesday is the quietest day overall. But if we consider the two cities separately, we might come to a different conclusion.

You have seen how grouping by city or day works. Now write a function that will group by both.

Create the `number_tracks()` function to calculate the number of songs played for a given day and city. It will require two parameters:
* day of the week
* name of the city

In the function, use a variable to store the rows from the original table, where:
  * `'day'` column value is equal to the `day` parameter
  * `'city'` column value is equal to the `city` parameter

Apply consecutive filtering with logical indexing.

Then calculate the `'user_id'` column values in the resulting table. Store the result to a new variable. Return this variable from the function.

In [20]:
# <creating the function number_tracks()>
def number_tracks(day,city): # We'll declare a function with two parameters: day=, city=.
    track_list = df.loc[(df['day'] == day) & (df['city'] == city)] # Let the track_list variable store the df rows where   
    # the value in the 'day' column is equal to the day= parameter and, at the same time, 
        # the value in the 'city' column is equal to the city= parameter (apply consecutive filtering 
        # with logical indexing).
    track_list_count = track_list['user_id'].count() # Let the track_list_count variable store the number of 'user_id' column values in track_list
    return track_list_count # (found with the count() method).
# Let the function return a number: the value of track_list_count.

# The function counts tracked played for a certain city and day.
# It first retrieves the rows with the intended day from the table,
# then filters out the rows with the intended city from the result,
# then finds the number of 'user_id' values in the filtered table,
# then returns that number.
# To see what it returns, wrap the function call in print().


<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 1</b> 
    
Excellent function, works as required

Call `number_tracks()` six times, changing the parameter values, so that you retrieve the data on both cities for each of the three days.

In [21]:
number_tracks('Monday', 'Springfield')
# the number of songs played in Springfield on Monday

15740

In [22]:
number_tracks('Monday', 'Shelbyville') # the number of songs played in Shelbyville on Monday

5614

In [23]:
number_tracks('Wednesday', 'Springfield') # the number of songs played in Springfield on Wednesday

11056

In [24]:
number_tracks('Wednesday', 'Shelbyville') # the number of songs played in Shelbyville on Wednesday

7003

In [25]:
number_tracks('Friday', 'Springfield') # the number of songs played in Springfield on Friday

15945

In [26]:
number_tracks('Friday', 'Shelbyville') #the number of songs played in Shelbyville on Friday

5895

Use `pd.DataFrame` to create a table, where
* Column names are: `['city', 'monday', 'wednesday', 'friday']`
* The data is the results you got from `number_tracks()`

In [27]:
# table with results
import pandas as pd
columns = ['City', 'Monday', 'Wednesday', 'Friday']
data = [['Springfield', 16715, 11755, 16890],
        ['Shelbyville', 5982, 0, 6259]]
df_number_tracks = pd.DataFrame(data=data, columns=columns)

<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment, v. 1</b> 

Please check correctness of the values above, now they are different with the result of the function
    
An also please display the table with results

**Conclusions**

The data reveals differences in user behavior:

- In Springfield, the number of songs played peaks on Mondays and Fridays, while on Wednesday there is a decrease in activity.
- In Shelbyville, on the contrary, users listen to music more on Wednesday. User activity on Monday and Friday is smaller.

So the first hypothesis seems to be correct.

[Back to Contents](#back)

### Hypothesis 2: music at the beginning and end of the week <a id='week'></a>

According to the second hypothesis, on Monday morning and Friday night, citizens of Springfield listen to genres that differ from ones users from Shelbyville enjoy.

Get tables (make sure that the name of your combined table matches the DataFrame given in the two code blocks below):
* For Springfield — `spr_general`
* For Shelbyville — `shel_general`

In [28]:
spr_general = df[df['city'] == 'Springfield']
print(spr_general.head())
# where the value in the 'city' column is 'Springfield'
# create the spr_general table from the df rows, 



    user_id                        track            artist   genre  \
1  55204538  Delayed Because of Accident  Andreas Rönnberg    rock   
4  E2DC1FAE                  Soul People        Space Echo   dance   
6  4CB90AA5                         True      Roman Messer   dance   
7  F03E1C1F             Feeling This Way   Polina Griffith   dance   
8  8FA1D3BE                     L’estate       Julia Dalia  ruspop   

          city      time        day  
1  Springfield  14:07:09     Friday  
4  Springfield  08:34:34     Monday  
6  Springfield  13:00:07  Wednesday  
7  Springfield  20:47:49  Wednesday  
8  Springfield  09:17:40     Friday  


In [29]:
# create the shel_general from the df rows,
# where the value in the 'city' column is 'Shelbyville'
shel_general = df[df['city'] == 'Shelbyville']
print(shel_general.head())


    user_id                  track            artist   genre         city  \
0  FFB692EC      Kamigata To Boots  The Mass Missile    rock  Shelbyville   
2    20EC38      Funiculì funiculà       Mario Lanza     pop  Shelbyville   
3  A3DD03C9  Dragons in the Sunset        Fire + Ice    folk  Shelbyville   
5  842029A1                 Chains          Obladaet  rusrap  Shelbyville   
9  E772D5C0              Pessimist           unknown   dance  Shelbyville   

       time        day  
0  20:28:33  Wednesday  
2  20:58:07  Wednesday  
3  08:37:09     Monday  
5  13:09:41     Friday  
9  21:20:49  Wednesday  


<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 1</b> 
    

Well done - you created separate dataframes with Springfield and Shelbyville data.

Write the `genre_weekday()` function with four parameters:
* A table for data (`df`)
* The day of the week (`day`)
* The first timestamp, in 'hh:mm' format (`time1`)
* The last timestamp, in 'hh:mm' format (`time2`)

The function should return info on the 15 most popular genres on a given day within the period between the two timestamps.

In [30]:
# 1) Let the genre_df variable store the rows that meet several conditions:
#    - the value in the 'day' column is equal to the value of the day= argument
#    - the value in the 'time' column is greater than the value of the time1= argument
#    - the value in the 'time' column is smaller than the value of the time2= argument
#    Use consecutive filtering with logical indexing.

# 2) Group genre_df by the 'genre' column, take one of its columns, 
#    and use the count() method to find the number of entries for each of 
#    the represented genres; store the resulting Series to the
#    genre_df_count variable

# 3) Sort genre_df_count in descending order of frequency and store the result
#    to the genre_df_sorted variable

# 4) Return a Series object with the first 15 genre_df_sorted value - the 15 most
#    popular genres (on a given day, within a certain timeframe)

# Write your function here
def genre_weekday(df, day, time1, time2):
    
    # consecutive filtering
    # Create the variable genre_df which will store only those df rows where the day is equal to day=
    genre_df = df[df['day'] == day] # write your code here

    # filter again so that genre_df will store only those rows where the time is smaller than time2=
    genre_df = df[df['time'] < time2] # write your code here

    # filter once more so that genre_df will store only rows where the time is greater than time1=
    genre_df = df[df['time'] > time1] # write your code here

    # group the filtered DataFrame by the column with the names of genres, take the genre column, and find the number of rows for each genre with the count() method
    genre_df_count = df.groupby('genre')['genre'].count() # write your code here

    # sort the result in descending order (so that the most popular genres come first in the Series object)
    genre_df_sorted = genre_df_count.sort_values(ascending=False) # write your code here

    # we will return the Series object storing the 15 most popular genres on a given day in a given timeframe
    return genre_df_sorted[:15]

<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment, v. 1</b> 

The function is very good. But there are a few points that need to be corrected:
    
* in points `filter again so...` and  `filter once more...` we should rewrite our dataframe with help of genre_df, not with df
    
* in point `group the filtered DataFrame...` we should rewrite our dataframe with help of genre_df, not with df

Compare the results of the `genre_weekday()` function for Springfield and Shelbyville on Monday morning (from 7AM to 11AM) and on Friday evening (from 17:00 to 23:00):

In [31]:
genre_weekday(shel_general, 'Monday', '07:00', '11:00') # calling the function for Monday morning in Springfield (use spr_general instead of the df table)


genre
pop            2431
dance          1932
rock           1879
electronic     1736
hip             934
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
jazz            486
metal           378
soundtrack      331
rnb             321
rap             309
Name: genre, dtype: int64

In [411]:
genre_weekday(spr_general, 'Monday', '07:00', '11:00')# calling the function for Monday morning in Shelbyville (use shel_general instead of the df table)


genre
pop            5892
dance          4435
rock           3965
electronic     3786
hip            2041
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
jazz            980
unknown         849
metal           832
soundtrack      785
folk            692
Name: genre, dtype: int64

In [412]:
genre_weekday(spr_general, 'Friday', '17:00', '23:00')# calling the function for Friday evening in Springfield


genre
pop            5892
dance          4435
rock           3965
electronic     3786
hip            2041
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
jazz            980
unknown         849
metal           832
soundtrack      785
folk            692
Name: genre, dtype: int64

In [413]:
genre_weekday(shel_general, 'Friday', '17:00', '23:00')# calling the function for Friday evening in Shelbyville


genre
pop            2431
dance          1932
rock           1879
electronic     1736
hip             934
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
jazz            486
metal           378
soundtrack      331
rnb             321
rap             309
Name: genre, dtype: int64

**Conclusion**

Having compared the top 15 genres on Monday morning, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to similar music. The top five genres are the same, only rock and electronic have switched places.

2. In Springfield, the number of missing values turned out to be so big that the value `'unknown'` came in 10th. This means that missing values make up a considerable portion of the data, which may be a basis for questioning the reliability of our conclusions.

For Friday evening, the situation is similar. Individual genres vary somewhat, but on the whole, the top 15 is similar for the two cities.

Thus, the second hypothesis has been partially proven true:
* Users listen to similar music at the beginning and end of the week.
* There is no major difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the number of missing values makes this result questionable. In Springfield, there are so many that they affect our top 15. Were we not missing these values, things might look different.

[Back to Contents](#back)

### Hypothesis 3: genre preferences in Springfield and Shelbyville <a id='genre'></a>

Hypothesis: Shelbyville loves rap music. Springfield's citizens are more into pop.

Group the `spr_general` table by genre and find the number of songs played for each genre with the `count()` method. Then sort the result in descending order and store it to `spr_genres`.

In [414]:
spr_gerneral = spr_general.groupby('genre')['track'].count() # on one line: group the spr_general table by the 'genre' column, 
# count the 'genre' values with count() in the grouping, 
spr_genres = spr_general.sort_values(by = 'genre', ascending = False) # sort the resulting Series in descending order, and store it to spr_genres

<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment, v. 1</b> 

Please make sorting in the second row, like:
    
    spr_genres = spr_gerneral.sort_values(ascending=False)

Print the first 10 rows from `spr_genres`:

In [415]:
print(spr_genres.head(10))# printing the first 10 rows of spr_genres

        user_id                    track            artist      genre  \
8514   A439123F                  Flip It           unknown        ïîï   
28665  D7FB50DA          Drumming Circle  Professor Trance  worldbeat   
15337  F1AFA4BA             Mae Pa Fidje   Teofilo Chantre      world   
26052  6B799FBC                      Eba           Tamaley      world   
44838  54F93A25                Franc Jeu      Medhy Custos      world   
44811  4C122665  Masha Allah Allah Allah          DJ Nabil      world   
5920   382BE6A7                 Yerazoum  Mihran Tsarukyan      world   
58178  12B1B81D              Nós e o Mar        Tamba Trio      world   
18897  9097998C              Sandy Beach            Broque      world   
5925   7FB00AC8                  '74 '75           The Luc      world   

              city      time        day  
8514   Springfield  09:08:51     Friday  
28665  Springfield  09:30:47     Monday  
15337  Springfield  14:45:29     Friday  
26052  Springfield  21:45:47

Now do the same with the data on Shelbyville.

Group the `shel_general` table by genre and find the number of songs played for each genre. Then sort the result in descending order and store it to the `shel_genres` table:


In [416]:
shel_gerneral = shel_general.groupby('genre')['track'].count() # on one line: group the shel_general table by the 'genre' column, 
# count the 'genre' values in the grouping with count(), 
shel_genres = shel_general.sort_values(by = 'genre', ascending = False)# sort the resulting Series in descending order and store it to shel_genres

<div class="alert alert-danger"; style="border-left: 7px solid red">
<b>⛔️ Reviewer's comment, v. 1</b> 

And here again, please use `shel_gerneral`, like:
    
    shel_genres = shel_gerneral.sort_values(ascending=False)

Print the first 10 rows of `shel_genres`:

In [417]:
print(shel_genres.head(10))# printing the first 10 rows from shel_genres

        user_id                track                     artist      genre  \
40493  AA1730E8                  Anu          Kailash Kokopelli  worldbeat   
31420  4F773441                Tibet                 Furunkulus      world   
13204  BA0FE104                 Ride            Elji Beatzkilla      world   
39937  38EF7CEB       El Forga Moura                       Haim      world   
20014  58EA4E8A  King Of The Fairies                Tom McHaile      world   
39996  2297CBFB          Minor Swing  International String Trio      world   
1229    47E5088       Massachussetts              Cover Masters      world   
13201  F0F48477          Mustt Mustt                    unknown      world   
5412   1FE0862E      Sakura Shamisen                     Kensei      world   
61034   CAB3F3A                Dovdu                    Angelit      world   

              city      time        day  
40493  Shelbyville  20:16:34  Wednesday  
31420  Shelbyville  21:07:29  Wednesday  
13204  Shelbyvi

**Conclusion**

The hypothesis has been partially proven true:
* Pop music is the most popular genre in Springfield, as expected.
* However, pop music turned out to be equally popular in Springfield and Shelbyville, and rap wasn't in the top 5 for either city.


[Back to Contents](#back)

# Findings <a id='end'></a>

We have tested the following three hypotheses:

1. User activity differs depending on the day of the week and from city to city. 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. This is also true for Friday evenings. 
3. Springfield and Shelbyville listeners have different preferences. In Springfield, they prefer pop, while Shelbyville has more rap fans.

After analyzing the data, we concluded:

1. User activity in Springfield and Shelbyville depends on the day of the week, though the cities vary in different ways. 

The first hypothesis is fully accepted.

2. Musical preferences do not vary significantly over the course of the week in both Springfield and Shelbyville. We can see small differences in order on Mondays, but:
* In Springfield and Shelbyville, people listen to pop music most.

So we can't accept this hypothesis. We must also keep in mind that the result could have been different if not for the missing values.

3. It turns out that the musical preferences of users from Springfield and Shelbyville are quite similar.

The third hypothesis is rejected. If there is any difference in preferences, it cannot be seen from this data.

### Note 
In real projects, research involves statistical hypothesis testing, which is more precise and more quantitative. Also note that you cannot always draw conclusions about an entire city based on the data from just one source.

You will study hypothesis testing in the sprint on statistical data analysis.

<div class="alert alert-success"; style="border-left: 7px solid green">
<b>✅ Reviewer's comment, v. 1</b> 
    
Overall conclusion is an important part, where we should include the summary of the outcomes of the project.

[Back to Contents](#back)