<a href="https://colab.research.google.com/github/mimomaina/pythondataanalysis/blob/main/US_BabyNames.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [1]:
import pandas as pd

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv).

### Step 3. Assign it to a variable called baby_names.

In [2]:
baby_names = pd.read_csv('/content/StateNames.csv')

### Step 4. See the first 10 entries

In [3]:
print(baby_names.head(10))


   Id      Name  Year Gender State  Count
0   1      Mary  1910      F    AK   14.0
1   2     Annie  1910      F    AK   12.0
2   3      Anna  1910      F    AK   10.0
3   4  Margaret  1910      F    AK    8.0
4   5     Helen  1910      F    AK    7.0
5   6     Elsie  1910      F    AK    6.0
6   7      Lucy  1910      F    AK    6.0
7   8   Dorothy  1910      F    AK    5.0
8   9      Mary  1911      F    AK   12.0
9  10  Margaret  1911      F    AK    7.0


### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [4]:
baby_names = baby_names.drop(columns=['Unnamed: 0', 'Id'], errors='ignore')


### Step 6. Is there more male or female names in the dataset?

In [5]:
gender_count = baby_names['Gender'].value_counts()
print(f"Male names: {gender_count.get('M', 0)}")
print(f"Female names: {gender_count.get('F', 0)}")

if gender_count.get('M', 0) > gender_count.get('F', 0):
    print("There are more male names in the dataset.")
else:
    print("There are more female names in the dataset.")


Male names: 1162622
Female names: 1537490
There are more female names in the dataset.


### Step 7. Group the dataset by name and assign to names

In [6]:
names = baby_names.groupby('Name').sum(numeric_only=True)
print(names.head())


           Year   Count
Name                   
Aadan      6031    18.0
Aadarsh    2009     5.0
Aaden    186937  1743.0
Aadhav     2014     6.0
Aadhya    40254   217.0


### Step 8. How many different names exist in the dataset?

In [7]:
unique_names_count = baby_names['Name'].nunique()
print(f"There are {unique_names_count} different names in the dataset.")


There are 24528 different names in the dataset.


### Step 9. What is the name with most occurrences?

In [8]:
most_common_name = names['Count'].idxmax()
most_common_count = names['Count'].max()
print(f"The name with the most occurrences is '{most_common_name}' with {most_common_count} occurrences.")


The name with the most occurrences is 'James' with 2173368.0 occurrences.


### Step 10. How many different names have the least occurrences?

In [9]:
least_occurrence_count = names['Count'].min()
least_common_names = names[names['Count'] == least_occurrence_count]
print(f"Number of names with the least occurrences ({least_occurrence_count}): {len(least_common_names)}")


Number of names with the least occurrences (5.0): 3728


### Step 11. What is the median name occurrence?

In [10]:
median_occurrence = names['Count'].median()
print(f"The median name occurrence is: {median_occurrence}")


The median name occurrence is: 48.0


### Step 12. What is the standard deviation of names?

In [11]:
std_deviation = names['Count'].std()
print(f"The standard deviation of name occurrences is: {std_deviation}")


The standard deviation of name occurrences is: 47513.43031437322


### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [12]:
summary = names['Count'].describe()
print("Summary statistics for name occurrences:")
print(summary)


Summary statistics for name occurrences:
count    2.452800e+04
mean     5.704112e+03
std      4.751343e+04
min      5.000000e+00
25%      1.000000e+01
50%      4.800000e+01
75%      3.720000e+02
max      2.173368e+06
Name: Count, dtype: float64
