# Analyzing a simple dataset

In this notebook we will do the following:

1. Read a dataset from an S3 bucket that we created as part of the lab. The dataset we would be reading is the StateNames.csv dataset that you should have in your S3 bucket (s3://lab2-your-gu-netid) at this point.

2. Find out the number of occurences each _Name_ occurs. Pay attention to the _Count_ column.

3. Find out the most frequently occuring name. 

4. Plot a timeseries chart for the most frequently occuring name.

5. What insight does the timeseries chart reveal?

In [34]:
import matplotlib.pyplot as plt
import pandas as pd
import os

## 1. Read the data

Pandas can read the data directly from S3.

In [35]:
# set the bucket name and file name in separate variables so that
# we can construct the S3 URI without completely hardcoding it

bucket_name = "lab2-aa1603" # REPLACE THIS with your own bucket's name
dataset_name = "StateNames.csv"
dataset_path_in_s3 = os.path.join("s3://", bucket_name, dataset_name)

# read the dataset into a Pandas dataframe
df = pd.read_csv(dataset_path_in_s3)
print(f"read the dataset from {dataset_path_in_s3} into a dataframe, shape of the dataframe is {df.shape}")

# a random sample of the dataframe
display(df.sample(10))

read the dataset from s3://lab2-aa1603/StateNames.csv into a dataframe, shape of the dataframe is (5647426, 6)


Unnamed: 0,Id,Name,Year,Gender,State,Count
3255411,3255412,Jenna,1988,F,NE,53
1108547,1108548,Johnnie,1979,M,FL,42
1238419,1238420,Sherlin,2007,F,GA,5
739219,739220,Darcy,1984,F,CO,10
3049377,3049378,Valinda,1951,F,NC,6
807203,807204,Brody,2012,M,CO,85
1195917,1195918,Sonji,1970,F,GA,5
4659690,4659691,Tracee,1976,F,TN,6
1564608,1564609,Tina,1965,F,IL,727
1379859,1379860,Muriel,1913,F,IA,16


## 2. Number of occurences of each name over the years

The _Count_ field gives a count of the number of times a name occurs in a state in an year. To count the total number of occurences over the year we need to sum up the _Count_ field for each name. We sort the output in descending order so that the most frequent name is at the top of the list. The output of the _groupby_ and _sum_ is a _Pandas Series_.

## 3. Most common name

The most common name is at the top of the series. The _index_ of the series gives the values of the column on which the groupby was done, since this series is sorted so that first value in the index array corresponds to the most frequently occuring name.

 the most frequently occuring name in the dataset is "James"


## 4. Create a timeseries just for the most frequent name

This is done by filtering the original dataframe to only include rows for which the _Name_ field matches the most frequent name and then sum'ing the count for each year across different states.

Create a simple plot using matplotlib and also save the figure.

## 5. The Insight
