# Feature Engineering

In [2]:
# Data Mining
import pandas as pd

# Model Building
import statsmodels.api as sm

In [3]:
df= pd.read_csv("Heads of State.csv")
df.head(5)

Unnamed: 0,Name,Wikipedia Page,Description,Image,Birth Year,Age Term Began (approx.),Term Began,Year Term Began,Term Ended,Year Term Ended,...,Royal?,Current?,Birth Place (current city),Country of Birth (current country),Ruler of,Country of Ruled Territory (Current),Political Party,Studies,Role,Religion
0,"Heinrich II, Hoya",http://de.wikipedia.org/wiki/Heinrich_II._(Hoya),,,,,1235,1235,1290,1290,...,Yes,No,,,Hoya,Germany,,,Count,
1,Pandukabhaya of Sri Lanka,http://en.wikipedia.org/wiki/Pandukabhaya_of_S...,Pandukabhaya (437 BC – 367 BC) was King of Upa...,,,,-437,-437,-367,-255,...,Yes,No,,,Anuradhapura,Sri Lanka,,,King,
2,Arechis I of Benevento,http://en.wikipedia.org/wiki/Arechis_I_of_Bene...,"Arechis I (also Arigis, Aretchis, Italian: Are...",,,,591,591,641,641,...,Yes,No,,,Benevento,Italy,,,Duke,
3,Vikramaditya VI,http://en.wikipedia.org/wiki/Vikramaditya_VI,Vikramaditya VI (1076 – 1126 CE) became the We...,http://commons.wikimedia.org/wiki/Special:File...,,,1076,1076,1126,1126,...,Yes,No,,,Western Chalukya Empire,India,,,King,
4,Ponhea Yat,http://en.wikipedia.org/wiki/Ponhea_Yat,"Preah Ponhea Yat (Khmer: ព្រះបាទពញាយ៉ាត, 1393–...",,,,1393,1393,1463,1463,...,Yes,No,,,Khmer Empire,Cambodia,,,King,


# Exploratory Data Analysis

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 454 entries, 0 to 453
Data columns (total 22 columns):
Name                                    454 non-null object
Wikipedia Page                          336 non-null object
Description                             331 non-null object
Image                                   223 non-null object
Birth Year                              296 non-null object
Age Term Began (approx.)                295 non-null float64
Term Began                              454 non-null object
Year Term Began                         454 non-null object
Term Ended                              361 non-null object
Year Term Ended                         361 non-null object
Term length                             454 non-null object
Days in Term                            454 non-null object
Royal?                                  454 non-null object
Current?                                454 non-null object
Birth Place (current city)              189 non-null o

### _Notable Findings_
- Several columns are to be dropped because they are not relevant for predictive modeling :
    - Wikipedia page
    - Description
    - Image
- Several columns are to be dropped because their data contains correlated information : 
    - Birth Year 
    - Term Began
    - Year Term Began
    - Term Ended
    - Year Term Ended
    - Term length
    - Birth Place (current city)
    - Country of Birth (current country)
    - 'Ruler of'
    - 'Country of Ruled Territory (Current)
    - Political Party
    - Studies
    - Role 
- Data will need to be subsetted to only include people who are not currently heads of state. (If we included current heads of state, we're not going to get a great look at their length of reign, because their reign is continuing!)
    - Current?
- Target will need to undergo datatype transformation (to be stored as float rather than object - numeric data), as well as be expressed in terms of years (For example, if someone reigned for 330 days, we would expect this value to be approximately 0.9.)
    - Days in Term 
- Categoircal columns with multiple classification levels will need to be transformed into dummy variables in order to utilize as predictor varibales :
    - Royal?
    - Religion 

In [5]:
# Dropping columns 
df.drop(['Wikipedia Page','Description','Image','Political Party','Studies','Birth Year', 
         'Term Began','Year Term Began','Term Ended','Year Term Ended','Term length',
         'Birth Place (current city)','Country of Birth (current country)','Ruler of',
         'Country of Ruled Territory (Current)','Political Party','Studies','Role'],axis=1,inplace=True)

# Inspecting
df.head(5)

Unnamed: 0,Name,Age Term Began (approx.),Days in Term,Royal?,Current?,Religion
0,"Heinrich II, Hoya",,20088,Yes,No,
1,Pandukabhaya of Sri Lanka,,25567,Yes,No,
2,Arechis I of Benevento,,18262,Yes,No,
3,Vikramaditya VI,,18262,Yes,No,
4,Ponhea Yat,,25567,Yes,No,


In [6]:
# Subsetting DataFrame to include only those who are not currently heads of state 
df = df[df['Current?'] == 'No']

# Inspecting 
df['Current?'].value_counts()

No    363
Name: Current?, dtype: int64

In [7]:
# Converting 'Days in Term' datatype
# First removing non-numeric characters with regular expression
df['Days in Term'] = df['Days in Term'].str.replace(r"[^0-9]",'') 

# Now converting from object to float 
df['Days in Term'] = df['Days in Term'].astype(float)

# Inspecting
df['Days in Term'].dtype

dtype('float64')

In [8]:
# Expressing 'Length of Reign' in terms of years
# Creating new DataFrame column
df['Length of Reign'] = df['Days in Term']/365

# Inspecting
df[['Days in Term','Length of Reign']].head(10)

Unnamed: 0,Days in Term,Length of Reign
0,20088.0,55.035616
1,25567.0,70.046575
2,18262.0,50.032877
3,18262.0,50.032877
4,25567.0,70.046575
5,26525.0,72.671233
6,19723.0,54.035616
7,30204.0,82.750685
8,21915.0,60.041096
9,18627.0,51.032877


## _Does being royal have a significant effect on the length of one's reign?_

In [9]:
# Creating Royal Dummy Variable 
royal_dummy = pd.get_dummies(df['Royal?'],drop_first=True,prefix='Royal')

# Inspecting
royal_dummy.head(5)

Unnamed: 0,Royal_Yes
0,1
1,1
2,1
3,1
4,1


In [10]:
# Adding Royal Dummy to DataFrame
df = pd.concat([df,royal_dummy],axis=1)

# Inspecting
df.head(5)

Unnamed: 0,Name,Age Term Began (approx.),Days in Term,Royal?,Current?,Religion,Length of Reign,Royal_Yes
0,"Heinrich II, Hoya",,20088.0,Yes,No,,55.035616,1
1,Pandukabhaya of Sri Lanka,,25567.0,Yes,No,,70.046575,1
2,Arechis I of Benevento,,18262.0,Yes,No,,50.032877,1
3,Vikramaditya VI,,18262.0,Yes,No,,50.032877,1
4,Ponhea Yat,,25567.0,Yes,No,,70.046575,1


In [11]:
# Building OLS model 

x = df['Royal_Yes']
y = df['Length of Reign']
x = sm.add_constant(x)
model = sm.OLS(y,x)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,Length of Reign,R-squared:,0.37
Model:,OLS,Adj. R-squared:,0.368
Method:,Least Squares,F-statistic:,211.8
Date:,"Thu, 02 Aug 2018",Prob (F-statistic):,4.51e-38
Time:,20:41:51,Log-Likelihood:,-1251.6
No. Observations:,363,AIC:,2507.0
Df Residuals:,361,BIC:,2515.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,35.5415,1.525,23.299,0.000,32.542,38.541
Royal_Yes,23.0040,1.581,14.552,0.000,19.895,26.113

0,1,2,3
Omnibus:,89.284,Durbin-Watson:,1.795
Prob(Omnibus):,0.0,Jarque-Bera (JB):,172.008
Skew:,1.326,Prob(JB):,4.46e-38
Kurtosis:,5.082,Cond. No.,7.49


### _Interpreting OLS Results_
- Being listed as Royal increases length of reign by about 23 years


## _Does having a religion listed have a significant effect on the length of one's reign?_

In [12]:
# Transforming column into boolean 
df['Religion'] = df['Religion'].isnull()
df['Religion'].head(5)

0    True
1    True
2    True
3    True
4    True
Name: Religion, dtype: bool

In [13]:
# Encoding column as dummy variable 
df['Religion'] = pd.get_dummies(df['Religion'])
df['Religion'].head(5)

0    0
1    0
2    0
3    0
4    0
Name: Religion, dtype: uint8

In [14]:
# Building OLS model 

x = df['Religion']
y = df['Length of Reign']
x = sm.add_constant(x)
model = sm.OLS(y,x)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,Length of Reign,R-squared:,0.055
Model:,OLS,Adj. R-squared:,0.052
Method:,Least Squares,F-statistic:,20.94
Date:,"Thu, 02 Aug 2018",Prob (F-statistic):,6.54e-06
Time:,20:41:56,Log-Likelihood:,-1325.1
No. Observations:,363,AIC:,2654.0
Df Residuals:,361,BIC:,2662.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,57.5842,0.509,113.179,0.000,56.584,58.585
Religion,-8.6986,1.901,-4.576,0.000,-12.437,-4.960

0,1,2,3
Omnibus:,24.562,Durbin-Watson:,1.343
Prob(Omnibus):,0.0,Jarque-Bera (JB):,52.363
Skew:,0.353,Prob(JB):,4.26e-12
Kurtosis:,4.721,Cond. No.,3.9


### _Interpreting OLS Results_
- Having a religion listed lessens the length of reign by about 8.5 years.

## _Build a MLR model to predict length of reign_

In [18]:
df.dropna(inplace=True)

In [19]:
# Building OLS model 

x = df[['Age Term Began (approx.)','Religion','Royal_Yes']]
y = df['Length of Reign']
x = sm.add_constant(x)
model = sm.OLS(y,x)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,Length of Reign,R-squared:,0.56
Model:,OLS,Adj. R-squared:,0.553
Method:,Least Squares,F-statistic:,84.7
Date:,"Thu, 02 Aug 2018",Prob (F-statistic):,2.08e-35
Time:,20:42:54,Log-Likelihood:,-697.04
No. Observations:,204,AIC:,1402.0
Df Residuals:,200,BIC:,1415.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,45.1236,2.613,17.270,0.000,39.971,50.276
Age Term Began (approx.),-0.2467,0.047,-5.206,0.000,-0.340,-0.153
Religion,1.4753,1.830,0.806,0.421,-2.134,5.084
Royal_Yes,17.7406,2.187,8.111,0.000,13.428,22.054

0,1,2,3
Omnibus:,56.088,Durbin-Watson:,1.656
Prob(Omnibus):,0.0,Jarque-Bera (JB):,116.612
Skew:,1.302,Prob(JB):,4.77e-26
Kurtosis:,5.634,Cond. No.,149.0


### _Interpreting OLS Results_
- Without the reporting of the age term began, religion, or royal status the length of reign is approximately 45 years
- For each additional year of age when term began, the length of reign decreases by about 0.25 years 
- Having a religion listed increases the length of reign by about 1.5 years
- Having royal status increases the length of reign by almost 18 years