# Question 1

For this question use the World Bank Data for Turkey for the following indicators. Use [wbgapi](https://pypi.org/project/wbgapi/) for getting the data.

* [Literacy rate, adult female (SE.ADT.LITR.FE.ZS)](https://data.worldbank.org/indicator/SE.ADT.LITR.FE.ZS)
* [Labor force, female (SL.TLF.TOTL.FE.ZS)](https://data.worldbank.org/indicator/SL.TLF.TOTL.FE.ZS)
* [Poverty headcount ratio at national poverty lines (SI.POV.NAHC)](https://data.worldbank.org/indicator/SI.POV.NAHC)
* [Current health expenditure per capita (SH.XPD.CHEX.PC.CD)](https://data.worldbank.org/indicator/SH.XPD.CHEX.PC.CD)
* [GDP per capita (NY.GDP.PCAP.CD)](https://data.worldbank.org/indicator/NY.GDP.PCAP.CD)
* [Mortality rate, under-5 (SH.DYN.MORT)](https://data.worldbank.org/indicator/SH.DYN.MORT)


Using the [statsmodels](https://www.statsmodels.org/stable/index.html) library write the best linear regression model using child mortality as the dependent variable while the rest are considered as independent variables. Pay particular attention to the fact that the order of the variables put into the model significantly impacts the performance of the model. Choose the best model by considering

* with the minimum number of variables and their interactions,
* with the optimal ordering of the independent variables and their interactions,
* $R^2$-score of the model,
* statistical significance of the model coefficients,
* ANOVA analysis of the model.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn
import statsmodels.api as sm 
import wbgapi as wb 
import numpy as np
from statsmodels.formula.api import ols

literacy=wb.data.DataFrame('SE.ADT.LITR.FE.ZS','TUR')
labor=wb.data.DataFrame('SL.TLF.TOTL.FE.ZS','TUR')
poverty=wb.data.DataFrame('SI.POV.NAHC','TUR')
health=wb.data.DataFrame('SH.XPD.CHEX.PC.CD','TUR')
mortality=wb.data.DataFrame('SH.DYN.MORT','TUR')
gdp=wb.data.DataFrame('NY.GDP.PCAP.CD','TUR')


ltr=literacy.transpose()
lbr=labor.transpose()
pvt=poverty.transpose()
hth=health.transpose()
mtr=mortality.transpose()
gd=gdp.transpose()

def extract(df,name):
    tmp = df[['TUR']]
    tmp.columns = [[name]]
    return tmp

def lit_lab_pov_ht_gdp():
    lit = extract(ltr,'literacy')
    lr = extract(lbr,'labor')
    pt = extract(pvt,'poverty')
    hh = extract(hth,'health')
    gp = extract(gd,'gdp')
    mt = extract(mtr,'mortality')
    res = lit.join([lr,pt,hh,gp,mt])
    res.dropna(inplace=True)
    return res

res=lit_lab_pov_ht_gdp()





model = ols('mortality ~ literacy*poverty*gdp', data=res).fit()
print(sm.stats.anova_lm(model))
print(model.summary())

#Adj. R squared is 0.984

model = ols('mortality ~ literacy+poverty+gdp+literacy*gdp', data=res).fit()
print(sm.stats.anova_lm(model))
print(model.summary())

#Adj. R Squared is 0.985. This model is better.



# Question 2

For this question use Yahoo's Finance API for the following tickers:

* Gold futures (GC=F)
* Silver futures (SI=F)
* Copper futures (HG=F)
* Platinum futures (PL=F)

1. Write the best linear regression model that explains gold futures closing prices in terms of opening prices of gold, silver, copper, and platinum futures.
2. Repeat the same for silver, copper and platinum prices.
3. Compare the models you obtained in Steps 1 and 2. Which model is better? How do you decide? Explain.

In [None]:
import yfinance as yf
import pandas as pd
import wbgapi as wb
import numpy as np
import statsmodels.api as sm

import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
gold=yf.download('GC=F')
silver=yf.download('SI=F')
copper=yf.download('HG=F')
plat=yf.download('PL=F')

data=pd.DataFrame()

data['gold']=gold['Open']
data['silver']=silver['Open']
data['copper']=copper['Open']
data['plat']=plat['Open']
data['gold_close']=gold['Close']
data['silver_close']=silver['Close']
data['copper_close']=copper['Close']
data['plat_close']=plat['Close']

model = ols('gold_close ~ gold+silver+copper+plat', data=data).fit()
print(model.summary())

model = ols('silver_close ~ gold+silver+copper+plat', data=data).fit()
print(model.summary())

model = ols('copper_close ~ gold+silver+copper+plat', data=data).fit()
print(model.summary())

model = ols('plat_close ~ gold+silver+copper+plat', data=data).fit()
print(model.summary())

#First model is better because R squared is 1.00 others 0.999

# Question 3

1. Write a function that takes a ticker symbol and returns a pandas dataframe that for each day puts a 1 when the closing price is higher than the opening price, a 0 when the closing price is lower than the opening price.
2. Write the best logistic regression that predicts the time series you obtain from Step 1 for gold futures against the opening prices of gold, silver, copper, and platinum prices.
3. Repeat the same for silver, copper, and platinum prices.
4. Compare the models you obtained from Steps 2 and 3. Decide which is the best model, and explain your reasoning.
5. Does any of the models provide a good fit? Explain.

# Question 4

For this question use the following [data](https://archive.ics.uci.edu/ml/datasets/credit+approval):


In [3]:
credit = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data', header=None)

fn = {'+': 1, '-': 0}

X = credit.replace('?',0).iloc[:,[1,2,7,10,14]]
y = credit.iloc[:,15].map(lambda x: fn.get(x,0))

1. Split the data into training and test set.
2. Write different logistic regression models predicting y against X.
3. Construct [confusion matrices](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) on the test data set for these different models.
4. Analyze these models. Explain which model is the best model you have found.
5. Repeat Steps 1-4 several times. Does your best model stay as the best model? What should be the correct protocol to decide on the best model explaining the data?