# An Analysis of Zodiac and Excellence in Career Paths

In this project, I am reading several different datasets into pandas dataframes, cleaning them, loading them into a SQLite database and combining them with a SQL join, before analyzing the results.  

I began by reading each dataset from my local machine into a pandas dataframe, starting with the birthdate file courtesy of fellow GitHub-er richard512 (https://github.com/richard512/Little-Big-Data/blob/master/famous-birthdates.csv). 

To run this code on your own, replace the file path below with the filepath at which you cloned the repo on your local machine. 


In [1]:
import pandas as pd
import numpy 

#Read in raw data file into pandas df
bd_df = pd.read_csv(r'C:\Users\Jordan\Documents\CodeKY\Capstone_Project_Repo\Data\famous-birthdates.txt', delimiter = " ")
bd_df.head(10)

Unnamed: 0,name,lastname,firstname,articleNum,birthDate,birthMonth,birthDay,zodiac
1,Aaliyah,Aaliyah,,0,1979-01-16,1.0,16.0,Capricorn
2,"Aaron, Hank",Aaron,Hank,46,1934-02-05,2.0,5.0,Aquarius
3,"Abacha, Sani",Abacha,Sani,2,1943-09-20,9.0,20.0,Virgo
4,"Abbado, Claudio",Abbado,Claudio,9,1933-06-26,6.0,26.0,Cancer
5,"Abbas, Mahmoud",Abbas,Mahmoud,306,1935-03-26,3.0,26.0,Aries
6,"Abdel Rahman, Omar",Abdel Rahman,Omar,21,1938-05-03,5.0,3.0,Taurus
7,"Abdul-Jabbar, Kareem",Abdul-Jabbar,Kareem,11,1947-04-16,4.0,16.0,Aries
8,"Abdul-Rauf, Mahmoud",Abdul-Rauf,Mahmoud,0,1969-03-09,3.0,9.0,Pisces
9,"Abdullah II, King of Jordan",Abdullah II,King of Jordan,1,1962-01-30,1.0,30.0,Aquarius
10,"Abdullah, Abdullah",Abdullah,Abdullah,29,1960-01-01,1.0,1.0,Capricorn


## Cleaning Birthdate Data Set
### Working with Birthdates

Since this dataset is the main source of birthdates for my project, I am only interested in keeping rows with a birthdate. Given that birthdates are static, I assigned each a Date ID based on the day of the year to simplify this dataset and better prepare it for a SQL table. First, however, I converted the Birthdate column to datetime, before again dropping any rows with a missing birthdate.   

Before removing the birthdate altogether, however, I extracted the year. This year will be combined later with a portion of the person's first and last name to create a unique ID on which to join my datasets. 

In [2]:
#number of rows before any cleaning
len(bd_df)

4710

In [3]:
#drop rows without a birthDate
bd_df.dropna(subset = ['birthDate'], inplace=True)
len(bd_df)

4491

In [4]:
#convert birthdate to date/time data type, 
bd_df['birthDate'] = pd.to_datetime(bd_df['birthDate'], errors='coerce')
#then convert to a day of the year
bd_df['Date_Id'] = bd_df['birthDate'].dt.dayofyear
#drop rows where the date_id is NULL 
bd_df.dropna(subset = ['Date_Id'], inplace=True)
#extract birthyear from birthdate
bd_df['year'] = bd_df['birthDate'].dt.year
#drop unnecessary columns
bd_df = bd_df.drop(columns=['articleNum', 'birthDate', 'birthMonth', 'birthDay', 'zodiac'])
bd_df.head(10)


Unnamed: 0,name,lastname,firstname,Date_Id,year
1,Aaliyah,Aaliyah,,16.0,1979
2,"Aaron, Hank",Aaron,Hank,36.0,1934
3,"Abacha, Sani",Abacha,Sani,263.0,1943
4,"Abbado, Claudio",Abbado,Claudio,177.0,1933
5,"Abbas, Mahmoud",Abbas,Mahmoud,85.0,1935
6,"Abdel Rahman, Omar",Abdel Rahman,Omar,123.0,1938
7,"Abdul-Jabbar, Kareem",Abdul-Jabbar,Kareem,106.0,1947
8,"Abdul-Rauf, Mahmoud",Abdul-Rauf,Mahmoud,68.0,1969
9,"Abdullah II, King of Jordan",Abdullah II,King of Jordan,30.0,1962
10,"Abdullah, Abdullah",Abdullah,Abdullah,1.0,1960


## Cleaning Names

Luckily, this dataset separated out first and last name, but there were a couple of steps I took to further standardize the formatting of names. 

1) In instances where a person goes by only one name, I moved that name from lastname to firstname, and set the lastname to NULL
2) There are a handful of instances where the value in lastname is not exactly a lastname, (e.g. Abdullah II, King of Jordan). I replaced those manually.
3) In preparation of my next step, I removed any spaces from first and last names.
4) Finally, any instance where a lastname is NULL was set to be blank.

In [5]:
#fixing instances where the person goes by a single name 
# Copy 'LastName' to 'FirstName' where 'FirstName' is null
one_name = bd_df['firstname'].isnull()
bd_df.loc[one_name, 'firstname'] = bd_df.loc[one_name, 'lastname']
# Set 'LastName' to null for the rows where 'FirstName' was null
bd_df.loc[one_name, 'lastname'] = pd.NA
#fix instances where last name is e.g. King of Jordan
bd_df.at[9, 'lastname'] = ''
bd_df.at[9, 'firstname'] = 'Abdullah II, King of Jordan'
bd_df.at[1212, 'lastname'] = ''
bd_df.at[1212, 'firstname'] = 'Elizabeth II, Queen of Great Britain'
bd_df.at[3036, 'lastname'] = ''
bd_df.at[3036, 'firstname'] = 'Nicholas II, Czar of Russia'
#Remove Spaces in first and lastname
bd_df['firstname'] = bd_df['firstname'].str.replace(' ', '')
bd_df['lastname'] = bd_df['lastname'].str.replace(' ', '')
#fill NA with blanks
bd_df = bd_df.fillna('')
bd_df.head(20)


Unnamed: 0,name,lastname,firstname,Date_Id,year
1,Aaliyah,,Aaliyah,16.0,1979
2,"Aaron, Hank",Aaron,Hank,36.0,1934
3,"Abacha, Sani",Abacha,Sani,263.0,1943
4,"Abbado, Claudio",Abbado,Claudio,177.0,1933
5,"Abbas, Mahmoud",Abbas,Mahmoud,85.0,1935
6,"Abdel Rahman, Omar",AbdelRahman,Omar,123.0,1938
7,"Abdul-Jabbar, Kareem",Abdul-Jabbar,Kareem,106.0,1947
8,"Abdul-Rauf, Mahmoud",Abdul-Rauf,Mahmoud,68.0,1969
9,"Abdullah II, King of Jordan",,"AbdullahII,KingofJordan",30.0,1962
10,"Abdullah, Abdullah",Abdullah,Abdullah,1.0,1960


## Creating Joinable Value

I also needed a way to join my two sets of names.  Since they don't follow the same naming conventions, I decided to create a new column that takes the first 3 letters of a person's first name and last name and their birthyear and concatenates them to a (reasonably) unique value.  There were a very few duplicates that I handled manually for now.  

Finally, I dropped the unneeded columns and cleaned up headings before loading this into a SQLite database.  

In [6]:
#get the first 3 letters of the first and last name
#cast year to a string and concatenate to create a value to join with other dataset
bd_df['First3'] = bd_df['firstname'].str[:3]
bd_df['Last3'] = bd_df['lastname'].str[:3]
bd_df['year'] = bd_df['year'].astype(str)
bd_df['People_Lookup'] = bd_df['First3'] + bd_df['Last3'] + bd_df['year']
#fix some duplicate people lookups
bd_df.at[264, 'People_Lookup'] = 'KatBatt1948'
bd_df.at[2410, 'People_Lookup'] = 'LeeK1923'
bd_df.at[2485, 'People_Lookup'] = 'LiK1928'
#clean-up headings, drop added columns
bd_df = bd_df.rename(columns={'name': 'ImportName', 'lastname': 'LastName','firstname': 'FirstName', 'Date_Id': 'DateID'})
bd_df = bd_df.drop(columns=['year', 'First3', 'Last3'])
bd_df.head(10)


Unnamed: 0,ImportName,LastName,FirstName,DateID,People_Lookup
1,Aaliyah,,Aaliyah,16.0,Aal1979
2,"Aaron, Hank",Aaron,Hank,36.0,HanAar1934
3,"Abacha, Sani",Abacha,Sani,263.0,SanAba1943
4,"Abbado, Claudio",Abbado,Claudio,177.0,ClaAbb1933
5,"Abbas, Mahmoud",Abbas,Mahmoud,85.0,MahAbb1935
6,"Abdel Rahman, Omar",AbdelRahman,Omar,123.0,OmaAbd1938
7,"Abdul-Jabbar, Kareem",Abdul-Jabbar,Kareem,106.0,KarAbd1947
8,"Abdul-Rauf, Mahmoud",Abdul-Rauf,Mahmoud,68.0,MahAbd1969
9,"Abdullah II, King of Jordan",,"AbdullahII,KingofJordan",30.0,Abd1962
10,"Abdullah, Abdullah",Abdullah,Abdullah,1.0,AbdAbd1960


In [7]:
#write dataframe to table
import sqlite3
connection = sqlite3.connect('Zodiac_Analysis.db')
bd_df.to_sql('Famous_People_Import', connection, if_exists='replace')

4477