# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Lab 3.03 | Feature Engineering Lab

In this lab, you'll implement feature engineering on the "Heads of State" data.

Your $Y$ value should be the length of time (in years) each individual reigned.

In [202]:
import pandas as pd

state = pd.read_csv("Heads of State.csv")

In [203]:
state.describe()

Unnamed: 0,Age Term Began (approx.)
count,295.0
mean,25.949153
std,18.961493
min,0.0
25%,9.5
50%,23.0
75%,41.0
max,90.0


In [204]:
state['Term Ended'].sort_values().plot(kind='box')


TypeError: Empty 'DataFrame': no numeric data to plot

Exercise 1: As a first step, do some EDA and data cleaning. Don't go too far down the rabbit hole, but be able to identify potential pitfalls in the data!

In [None]:
num_cols = ['Birth Year','Age Term Began (approx.)', 'Term Began', 'Year Term Began','Term Ended', 'Year Term Ended', 'Term length', 'Days in Term']
state[num_cols] = state[num_cols].apply(pd.to_numeric, errors='coerce')
mask = state['Term Ended'] > 4000
state = state.drop(state[mask].index)

Exercise 2: Create the length of time in years each individual reigned. Include decimal values. (For example, if someone reigned for 330 days, we would expect this value to be approximately 0.9.)

In [None]:
state['Term length'] = state['Days in Term'] / 365
state.describe()

Exercise 3: It only makes sense if we're going to analyze people who are not currently heads of state. (If we included current heads of state, we're not going to get a great look at their length of reign, because their reign is continuing!) Subset your data accordingly.

In [None]:
state = state[state['Current?'] == 'No']
state['Current?'].value_counts()

Exercise 4: Does being royal have a significant effect on the length of one's reign?
- Build the model using `sm.OLS()`. Be sure to include a $y$-intercept!
- Check out the summary. Interpret the coefficient.
- Based on the $p$-value in the summary, mention what (if anything) you can conclude about the effect of being royal on the length one's reign.

In [None]:
def valer(row):
    if row == 'Yes':
        val = 1
    else: 
        val = 0
    return val
state['Royal_dummie'] = state['Royal?'].apply(valer)
state['Royal_dummie'].value_counts()

In [None]:
new_state = state.drop(['Name', 'Wikipedia Page', 'Description', 'Image', 'Birth Year',
       'Term Began', 'Age Term Began (approx.)',
       'Term Ended', 'Year Term Ended', 'Days in Term',
       'Royal?', 'Current?', 'Birth Place (current city)',
       'Country of Birth (current country)', 'Ruler of',
       'Country of Ruled Territory (Current)', 'Political Party', 'Studies',
       'Role', 'Religion'], axis=1)
new_state = new_state.dropna()
new_state.sort_values('Year Term Began')


In [None]:
import statsmodels.api as sm
dep = new_state['Term length']
indep = new_state.drop(['Term length', 'Year Term Began'], axis='columns')
indep = sm.add_constant(indep)
model = sm.OLS(dep, indep)
results = model.fit()
results.summary()
#We can reject the null hypothesis of 0 effect at the 99%
#level due to a p value < 0.01. We can interpret the coefficient
#'being royal (versus not royal) is associated with a 22 year increase in reign on average. 

Exercise 5: Does having a religion listed (column V) have a significant effect on the length of one's reign?
- Build the model using `sm.OLS()`. Be sure to include a $y$-intercept!
- Check out the summary. Interpret the coefficient.
- Based on the $p$-value in the summary, mention what (if anything) you can conclude about the effect of having a religion listed on the length of one's reign.

In [None]:
def religioner(row):
    if row != 0:
        val = 1
    else:
        val = 0
    return val 
state['Religion'] = state['Religion'].fillna(value=0)
state['Religion'] = state['Religion'].replace(to_replace='None', value=0)

In [None]:
state['Religion'] = state['Religion'].apply(religioner)
state['Religion'].value_counts()

In [None]:
new_state2 = state.drop(['Name', 'Wikipedia Page', 'Description', 'Image', 'Birth Year',
       'Age Term Began (approx.)', 'Term Began', 'Year Term Began',
       'Term Ended', 'Year Term Ended', 'Days in Term',
       'Royal?', 'Current?', 'Birth Place (current city)',
       'Country of Birth (current country)', 'Ruler of',
       'Country of Ruled Territory (Current)', 'Political Party', 'Studies',
       'Role', 'Royal_dummie', 'Relgion_dummie'], axis=1)
new_state2 = new_state2.dropna()

In [None]:
dep = new_state2['Term length']
indep = new_state2.drop('Term length', axis='columns')
indep = sm.add_constant(indep)
model = sm.OLS(dep, indep)
results = model.fit()
results.summary()
#religion is significant at the 1% significance level and we can reject the null hypothesis (i.e. of no effect)
#having a recorded religion is associated with 6 fewer years of reign on average. 

Exercise 6: Is there a significant interaction between being royal and having a religion listed?
- Build the model using `sm.OLS()`. Be sure to include a $y$-intercept!
- Based on the $p$-value in the summary, mention what (if anything) you can conclude about the interaction effect of royal and religion on the length of one's reign.

In [None]:
new_state3 = state.drop(['Name', 'Wikipedia Page', 'Description', 'Image', 'Birth Year',
       'Age Term Began (approx.)', 'Term Began', 'Year Term Began',
       'Term Ended', 'Year Term Ended', 'Days in Term',
       'Royal?', 'Current?', 'Birth Place (current city)',
       'Country of Birth (current country)', 'Ruler of',
       'Country of Ruled Territory (Current)', 'Political Party', 'Studies',
       'Role', 'Relgion_dummie'], axis=1)
new_state3['Interact'] = new_state3['Religion'] * new_state3['Royal_dummie']
new_state3 = new_state3.dropna()
new_state3.head()

In [None]:
dep = new_state3['Term length']
indep = new_state3.drop('Term length', axis=1)
indep = sm.add_constant(indep)
model = sm.OLS(dep, indep)
results = model.fit()
results.summary()
#Religion loses significance, and the interaction term doesnt have significance (cant reject null hypotheses at 10% level)
#pvalues are much larger than .05

Exercise 7: Does the age the term began have a significant effect on the length of one's reign?

In [None]:
new_state4 = state.drop(['Name', 'Wikipedia Page', 'Description', 'Image', 'Birth Year',
     'Term Began', 'Year Term Began',
       'Term Ended', 'Year Term Ended', 'Days in Term',
       'Royal?', 'Current?', 'Birth Place (current city)',
       'Country of Birth (current country)', 'Ruler of',
       'Country of Ruled Territory (Current)', 'Political Party', 'Studies',
       'Role', 'Religion', 'Royal_dummie', 'Relgion_dummie'], axis=1)
new_state4 = new_state4.dropna()

In [None]:
dep = new_state4['Term length']
indep = new_state4.drop('Term length', axis=1)
indep = sm.add_constant(indep)
model = sm.OLS(dep, indep)
results = model.fit()
results.summary()
#The null hypothesis can be rejected at the 1% level (p-score < .01). Each additional year is 
#associated with a ~1/2 year reduction in the length of reign. 

Exercise 8: Suppose you're wary of the data collection process here. Are there any concerns you might have about the data? (Perhaps about sampled versus target populations?)

In [None]:
#Spead of democracy is a relatively recent and widespread phenomena that is not accounted for by ancient monarchies.
#I question the term length (which has a mean of ~56 years), this seems way too long. 

Exercise 9: Build a multiple linear regression model to predict the length of an individual's reign. In addition to the previous features, engineer at least two more features. If you want to use some of the text features, you may find [.isin()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isin.html) helpful.

Discuss the results of your model.

In [None]:
stop = list(state.Description.values)
stop = [len(str(x)) for x in stop]
state['Decription Length'] = stop
state.head()

In [None]:
def europe(lst):
    europe = ['Albania', 'Andorra', 'Armenia', 'Austria', 'Azerbaijan', 'Belarus', 'Belgium', 'Bosnia and Herzegovina', 'Bulgaria', 'Croatia', 'Cyprus', 'Czech Republic', 'Denmark', 'Estonia', 'Finland', 'France', 'Georgia', 'Germany', 'Greece', 'Hungary', 'Iceland', 'Ireland', 'Italy', 'Kazakhstan', 'Kosovo', 'Latvia', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Macedonia', 'Malta', 'Moldova', 'Monaco', 'Montenegro', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Russia', 'San Marino', 'Serbia', 'Slovakia', 'Slovenia', 'Spain', 'Sweden', 'Switzerland', 'Turkey', 'Ukraine', 'United Kingdom', 'Vatican City ']
    if lst in europe:
        val = 1
    else:
        val = 0
    return val
state['Europe'] = state['Country of Ruled Territory (Current)'].apply(europe)
state.head()

In [None]:
state['Europe'].value_counts()

In [None]:
new_state5 = state.drop(['Name', 'Wikipedia Page', 'Description', 'Image', 'Birth Year',
     'Term Began', 'Religion', 'Age Term Began (approx.)',
       'Term Ended', 'Year Term Ended', 'Days in Term',
       'Royal?', 'Current?', 'Birth Place (current city)',
       'Country of Birth (current country)', 'Ruler of',
       'Country of Ruled Territory (Current)', 'Political Party', 'Studies',
       'Role'], axis=1)
new_state5 = new_state5.dropna()


In [205]:
dep = new_state5['Term length']
indep = new_state5.drop('Term length', axis=1)
indep = sm.add_constant(indep)
model = sm.OLS(dep, indep)
results = model.fit()
results.summary()
#I dont know why add_constant doesn't work. I added it manually. Neither Europe or length of
#description returned as significant at the 5% level. 

0,1,2,3
Dep. Variable:,Term length,R-squared:,0.387
Model:,OLS,Adj. R-squared:,0.38
Method:,Least Squares,F-statistic:,55.3
Date:,"Tue, 30 Jan 2018",Prob (F-statistic):,3.9e-36
Time:,16:06:00,Log-Likelihood:,-1215.8
No. Observations:,356,AIC:,2442.0
Df Residuals:,351,BIC:,2461.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Year Term Began,7.202e-05,0.001,0.135,0.893,-0.001,0.001
Royal_dummie,23.0356,1.651,13.950,0.000,19.788,26.283
Relgion_dummie,17.4583,0.989,17.646,0.000,15.512,19.404
Decription Length,0.0009,0.002,0.602,0.547,-0.002,0.004
Europe,0.2293,0.840,0.273,0.785,-1.423,1.882
const,17.4583,0.989,17.646,0.000,15.512,19.404

0,1,2,3
Omnibus:,72.379,Durbin-Watson:,1.931
Prob(Omnibus):,0.0,Jarque-Bera (JB):,118.188
Skew:,1.202,Prob(JB):,2.1700000000000002e-26
Kurtosis:,4.48,Cond. No.,1.62e+20
