# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [1]:
import pandas as pd

import numpy as np

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [2]:
baby_names = pd.read_csv('https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv', sep=',')

### Step 4. See the first 10 entries

In [3]:
baby_names.head()

Unnamed: 0.1,Unnamed: 0,Id,Name,Year,Gender,State,Count
0,11349,11350,Emma,2004,F,AK,62
1,11350,11351,Madison,2004,F,AK,48
2,11351,11352,Hannah,2004,F,AK,46
3,11352,11353,Grace,2004,F,AK,44
4,11353,11354,Emily,2004,F,AK,41


In [4]:
baby_names.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1016395 entries, 0 to 1016394
Data columns (total 7 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Unnamed: 0  1016395 non-null  int64 
 1   Id          1016395 non-null  int64 
 2   Name        1016395 non-null  object
 3   Year        1016395 non-null  int64 
 4   Gender      1016395 non-null  object
 5   State       1016395 non-null  object
 6   Count       1016395 non-null  int64 
dtypes: int64(4), object(3)
memory usage: 54.3+ MB


In [5]:
baby_names.shape

(1016395, 7)

In [16]:
# total number of names

baby_names.Name.count()

1016395

In [15]:
# how many unique names

baby_names.Name.nunique()

17632

In [18]:
# how many times a name is repeated

baby_names.Name.value_counts()

Riley       1112
Avery       1080
Jordan      1073
Peyton      1064
Hayden      1049
            ... 
Nyanza         1
Jonhatan       1
Juliani        1
Janard         1
Janari         1
Name: Name, Length: 17632, dtype: int64

In [20]:
# WE CAN CHECK FOR EACH NME HOW MANY TIMES REPEATED

(baby_names['Name'].values == 'Emily').sum()

565

### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [6]:
baby_names = baby_names.drop(labels = ['Unnamed: 0', 'Id'] , axis =1)

baby_names

Unnamed: 0,Name,Year,Gender,State,Count
0,Emma,2004,F,AK,62
1,Madison,2004,F,AK,48
2,Hannah,2004,F,AK,46
3,Grace,2004,F,AK,44
4,Emily,2004,F,AK,41
...,...,...,...,...,...
1016390,Seth,2014,M,WY,5
1016391,Spencer,2014,M,WY,5
1016392,Tyce,2014,M,WY,5
1016393,Victor,2014,M,WY,5


### Step 6. Is there more male or female names in the dataset?

In [7]:
#different ways to calculate 

In [8]:
# baby_names.Gender == 'M'  # returns boolean value

baby_names.Gender.value_counts()

F    558846
M    457549
Name: Gender, dtype: int64

In [9]:
(baby_names['Gender'].values == 'M').sum()

457549

In [10]:
(baby_names['Gender'].values == 'F').sum()

558846

In [11]:
baby_names[baby_names.Gender == 'M'].value_counts()

Name    Year  Gender  State  Count
Aaban   2013  M       NY     6        1
Lukas   2009  M       LA     10       1
                      NJ     43       1
                      NE     8        1
                      NC     43       1
                                     ..
Emmett  2006  M       AR     5        1
                      AL     6        1
        2005  M       WI     10       1
                      WA     15       1
Zyren   2013  M       TX     6        1
Length: 457549, dtype: int64

In [12]:
baby_names[baby_names.Gender == 'F'].value_counts()

Name     Year  Gender  State  Count
Aadhya   2007  F       CA     5        1
Lucinda  2008  F       NY     16       1
                       GA     7        1
                       IL     6        1
                       IN     6        1
                                      ..
Emme     2012  F       PA     6        1
                       OH     7        1
                       NY     15       1
                       NJ     5        1
Zyriah   2014  F       TX     6        1
Length: 558846, dtype: int64

### Step 7. Group the dataset by name and assign to names

In [28]:
# delete the year col

names = baby_names.drop(['Year'], axis=1)

names

Unnamed: 0,Name,Gender,State,Count
0,Emma,F,AK,62
1,Madison,F,AK,48
2,Hannah,F,AK,46
3,Grace,F,AK,44
4,Emily,F,AK,41
...,...,...,...,...
1016390,Seth,M,WY,5
1016391,Spencer,M,WY,5
1016392,Tyce,M,WY,5
1016393,Victor,M,WY,5


In [33]:
# only cols with int values will be added

names=names.groupby(['Name']).sum()

names

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Aaban,12
Aadan,23
Aadarsh,5
Aaden,3426
Aadhav,6
...,...
Zyra,42
Zyrah,11
Zyren,6
Zyria,59


In [35]:
names.shape

(17632, 1)

### Step 8. How many different names exist in the dataset?

In [34]:
baby_names.Name.nunique()

17632

In [36]:
# as names df is gouped by Name alredy, so only unique names included
len(names)

17632

### Step 9. What is the name with most occurrences?

In [43]:
# idxmax() find the index of the maximum value along the index axis

names.Count.idxmax()

'Jacob'

In [48]:
names.sort_values(by='Count', ascending=False)

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Jacob,242874
Emma,214852
Michael,214405
Ethan,209277
Isabella,204798
...,...
Eniola,5
Atlantis,5
Marci,5
Simarpreet,5


In [76]:
names.Count.max()

242874

### Step 10. How many different names have the least occurrences?

In [None]:
#different ways to do it

In [101]:
#  we have to find how many of names have least occurances

names['Count'].value_counts()

5        2578
6         751
10        700
11        474
12        279
         ... 
86156       1
8338        1
2199        1
22685       1
79849       1
Name: Count, Length: 3279, dtype: int64

In [119]:
# as we already know 5 is least occurent count

(names['Count'].values == 5).sum()

2578

In [120]:
# we get df with names having least occurence

names[names['Count'] == names.Count.min()]

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Aadarsh,5
Aadin,5
Aaima,5
Aalaya,5
Aaminah,5
...,...
Zyien,5
Zyire,5
Zykeriah,5
Zykierra,5


### Step 11. What is the median name occurrence?

In [83]:
names.median()

Count    49.0
dtype: float64

In [104]:
(names['Count'].values == 49).sum()

66

In [118]:
# To print df with names for count = 49

# names[names.Count == names.Count.median()]

names[names['Count'] == names.Count.median()]

Unnamed: 0_level_0,Count
Name,Unnamed: 1_level_1
Aishani,49
Alara,49
Alysse,49
Ameir,49
Anely,49
...,...
Sriram,49
Trinton,49
Vita,49
Yoni,49


### Step 12. What is the standard deviation of names?

In [121]:
names.std()

Count    11006.069468
dtype: float64

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [122]:
names.describe()

Unnamed: 0,Count
count,17632.0
mean,2008.932169
std,11006.069468
min,5.0
25%,11.0
50%,49.0
75%,337.0
max,242874.0
