# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [41]:
import numpy as np
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [42]:
bn = pd.read_csv("US_Baby_Names_right.csv")
bn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 7 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Unnamed: 0  1016395 non-null  int64 
 1   Id          1016395 non-null  int64 
 2   Name        1016395 non-null  object
 3   Year        1016395 non-null  int64 
 4   Gender      1016395 non-null  object
 5   State       1016395 non-null  object
 6   Count       1016395 non-null  int64 
dtypes: int64(4), object(3)
memory usage: 54.3+ MB


In [43]:
bn

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
...,...,...,...,...,...,...,...
1016390,5647421,5647422,Seth,2014,M,WY,5
1016391,5647422,5647423,Spencer,2014,M,WY,5
1016392,5647423,5647424,Tyce,2014,M,WY,5
1016393,5647424,5647425,Victor,2014,M,WY,5


### Step 4. See the first 10 entries

In [44]:
bn.head(10)


Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41
5,11354,11355,Abigail,2004,F,AK,37
6,11355,11356,Olivia,2004,F,AK,33
7,11356,11357,Isabella,2004,F,AK,30
8,11357,11358,Alyssa,2004,F,AK,29
9,11358,11359,Sophia,2004,F,AK,28


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [45]:
bn.drop(columns=['Id','Unnamed: 0'], inplace=True)


In [46]:
bn.head()

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41


### Step 6. Is there more male or female names in the dataset?

In [47]:
bn.Gender.value_counts()

Gender
F    558846
M    457549
Name: count, dtype: int64

### Step 7. Group the dataset by name and assign to names

In [56]:
bn.Name

0             Emma
1          Madison
2           Hannah
3            Grace
4            Emily
            ...   
1016390       Seth
1016391    Spencer
1016392       Tyce
1016393     Victor
1016394     Waylon
Name: Name, Length: 1016395, dtype: object

In [55]:
bn['Name']

0             Emma
1          Madison
2           Hannah
3            Grace
4            Emily
            ...   
1016390       Seth
1016391    Spencer
1016392       Tyce
1016393     Victor
1016394     Waylon
Name: Name, Length: 1016395, dtype: object

In [54]:
bn[['Name']]

Unnamed: 0,Name
0,Emma
1,Madison
2,Hannah
3,Grace
4,Emily
...,...
1016390,Seth
1016391,Spencer
1016392,Tyce
1016393,Victor


### Step 8. How many different names exist in the dataset?

In [57]:
bn[['Name']].nunique()

Name    17632
dtype: int64

In [None]:
bn.groupby('Name')

### Step 9. What is the name with most occurrences?

In [61]:
del bn['Year']

In [62]:
bn.head()

Unnamed: 0,Name,Gender,State,Count
0,Emma,F,AK,62
1,Madison,F,AK,48
2,Hannah,F,AK,46
3,Grace,F,AK,44
4,Emily,F,AK,41


In [66]:
bn.groupby('Name')['Count'].sum().sort_values(ascending=False)

Name
Jacob         242874
Emma          214852
Michael       214405
Ethan         209277
Isabella      204798
               ...  
Eniola             5
Atlantis           5
Marci              5
Simarpreet         5
Nita               5
Name: Count, Length: 17632, dtype: int64

In [68]:
bn.groupby('Name')['Count'].sum().idxmax()

'Jacob'

In [None]:
bn.groupby('Name')['Count'].sum()[bn.groupby('Name')['Count'].sum()]

In [69]:
bn.groupby('Name').Count.sum()

Name
Aaban        12
Aadan        23
Aadarsh       5
Aaden      3426
Aadhav        6
           ... 
Zyra         42
Zyrah        11
Zyren         6
Zyria        59
Zyriah       58
Name: Count, Length: 17632, dtype: int64

### Step 10. How many different names have the least occurrences?

In [67]:
bn.groupby('Name')['Count'].sum().sort_values(ascending=False).tail()

Name
Eniola        5
Atlantis      5
Marci         5
Simarpreet    5
Nita          5
Name: Count, dtype: int64

### Step 11. What is the median name occurrence?

In [70]:
names= bn.groupby('Name').Count.sum()

In [71]:
names.idxmax()

'Jacob'

In [73]:
names.loc[names.idxmax()]

np.int64(242874)

In [74]:
names.median()

np.float64(49.0)

In [76]:
mask = names==names.median()
any(mask)

True

In [79]:
names[mask]

Name
Aishani    49
Alara      49
Alysse     49
Ameir      49
Anely      49
           ..
Sriram     49
Trinton    49
Vita       49
Yoni       49
Zuleima    49
Name: Count, Length: 66, dtype: int64

### Step 12. What is the standard deviation of names?

In [80]:
names.std()

np.float64(11006.06946789057)

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [82]:
names.describe().to_frame()

Unnamed: 0,Count
count,17632.0
mean,2008.932169
std,11006.069468
min,5.0
25%,11.0
50%,49.0
75%,337.0
max,242874.0
