In [46]:
import pandas as pd
import numpy as np
import wbgapi as wb
import yfinance as yf

from statsmodels.formula.api import ols
import statsmodels.api as sm

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Overall Rules

- Refrain from saving datasets locally. You may experiment with your answers on a locally saved version of the datasets, but do not upload your local files with your homework as the datasets are very large. In your submitted answers datasets should be read from the original source URL.
- Document all of your steps by writing appropriate markdown cells in your notebook. Refrain from using code comments to explain what has been done.
- Avoid duplicating code. Do not copy and paste code from one cell to another. If copying and pasting is necessary, write a suitable function for the task at hand and call that function.
- Document your use of LLMs (ChatGPT, Claude, Code Pilot etc). Either take screenshots of your steps and include them with this notebook, or give me a full log (both questions and answers) in a markdown file named HW2-LLM-LOG.md.

Failure to adhere to these guidelines will result in a 25-point deduction for each infraction.

# HW2

## Q1

There are 22 countries surrounding the Mediterranean Sea: Spain, France, Monaco, Italy, Slovenia, Croatia, Bosnia and Herzegovina, Montenegro, Albania, Greece, Turkey, Syria, Lebanon, Israel, Palestine, Egypt, Libya, Tunisia, Algeria, and Morocco, with two island countries Malta and Cyprus.

1. Get the following data for every country in the list above from the World Bank Data server (using the `wbgapi` library)

- Adult female literacy (SE.ADT.LITR.FE.ZS)
- Adult female workforce participation rate (SL.TLF.ACTI.ZS)
- Child mortality rate (SP.DYN.IMRT.IN)
- Gini index (SI.POV.GINI)
- Life expectancy (SP.DYN.LE00.IN)
- GDP (NY.GDP.PCAP.CD)

2. Write a function that does linear regression for Log(mortality) against the other variables (except mortality).
3. Analyze the regression results for Spain, France, Turkey, Syria, and Israel.
4. Analyze the results for 2 other countries of your choice.

## Answers

Let us start by getting the data:

In [69]:
countries = ['ESP','FRA','MCO','ITA','HRV','BIH','MNE','ALB','GRC','TUR','SYR','LBN',
             'ISR','PSE','EGY','LBY','TUN','DZA','MAR','MLT','CYP']

series = {'Literacy': 'SE.ADT.LITR.FE.ZS',
          'Participation': 'SL.TLF.ACTI.ZS',
          'Mortality': 'SP.DYN.IMRT.IN',
          'Expectancy': 'SP.DYN.LE00.IN',
          'GINI': 'SI.POV.GINI',
          'GDP': 'NY.GDP.PCAP.CD'}


data = wb.data.DataFrame(list(series.values()), economy=countries).T
data

economy,ALB,ALB,ALB,ALB,ALB,ALB,BIH,BIH,BIH,BIH,...,TUN,TUN,TUN,TUN,TUR,TUR,TUR,TUR,TUR,TUR
series,NY.GDP.PCAP.CD,SE.ADT.LITR.FE.ZS,SI.POV.GINI,SL.TLF.ACTI.ZS,SP.DYN.IMRT.IN,SP.DYN.LE00.IN,NY.GDP.PCAP.CD,SE.ADT.LITR.FE.ZS,SI.POV.GINI,SL.TLF.ACTI.ZS,...,SI.POV.GINI,SL.TLF.ACTI.ZS,SP.DYN.IMRT.IN,SP.DYN.LE00.IN,NY.GDP.PCAP.CD,SE.ADT.LITR.FE.ZS,SI.POV.GINI,SL.TLF.ACTI.ZS,SP.DYN.IMRT.IN,SP.DYN.LE00.IN
YR1960,,,,,,54.439,,,,,...,,,,43.940,275.041699,,,,171.5,50.740
YR1961,,,,,,55.634,,,,,...,,,,44.146,282.742464,,,,166.2,51.550
YR1962,,,,,,56.671,,,,,...,,,181.7,45.513,307.306286,,,,160.8,52.382
YR1963,,,,,,57.844,,,,,...,,,173.2,46.500,347.177091,,,,155.5,53.173
YR1964,,,,,,58.983,,,,,...,,,165.0,47.365,365.133869,,,,150.4,53.714
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
YR2019,5396.214243,,30.1,70.373,8.4,79.282,6094.724823,,,57.265,...,,51.891,14.5,75.993,9215.440875,94.424042,43.8,57.710,8.6,77.832
YR2020,5343.037704,,29.4,67.363,8.4,76.989,6095.104237,,,57.784,...,,49.715,14.3,75.292,8638.739133,,43.0,54.048,8.1,75.850
YR2021,6377.203096,,,68.684,8.4,76.463,7230.198838,,,60.999,...,33.7,50.037,14.0,73.772,9743.213131,,44.4,56.321,7.7,76.032
YR2022,6810.114041,98.300003,,71.367,,,7568.798480,97.099998,,61.552,...,,52.348,,,10674.504173,,,58.314,,


For the models, I am going to rename the columns, interpolate the missing values and then backfill the rest.

In [70]:
data.rename(columns={v: k for (k,v) in series.items()},inplace=True)
data.interpolate(inplace=True)
data.bfill(inplace=True)
data

economy,ALB,ALB,ALB,ALB,ALB,ALB,BIH,BIH,BIH,BIH,...,TUN,TUN,TUN,TUN,TUR,TUR,TUR,TUR,TUR,TUR
series,GDP,Literacy,GINI,Participation,Mortality,Expectancy,GDP,Literacy,GINI,Participation,...,GINI,Participation,Mortality,Expectancy,GDP,Literacy,GINI,Participation,Mortality,Expectancy
YR1960,639.484730,98.252274,27.0,69.274,76.7,54.439,333.783179,81.959328,30.0,49.564,...,43.40,50.628,181.7,43.940,275.041699,45.098919,43.5,60.068,171.5,50.740
YR1961,639.484730,98.252274,27.0,69.274,76.7,55.634,333.783179,81.959328,30.0,49.564,...,43.40,50.628,181.7,44.146,282.742464,45.098919,43.5,60.068,166.2,51.550
YR1962,639.484730,98.252274,27.0,69.274,76.7,56.671,333.783179,81.959328,30.0,49.564,...,43.40,50.628,181.7,45.513,307.306286,45.098919,43.5,60.068,160.8,52.382
YR1963,639.484730,98.252274,27.0,69.274,76.7,57.844,333.783179,81.959328,30.0,49.564,...,43.40,50.628,173.2,46.500,347.177091,45.098919,43.5,60.068,155.5,53.173
YR1964,639.484730,98.252274,27.0,69.274,76.7,58.983,333.783179,81.959328,30.0,49.564,...,43.40,50.628,165.0,47.365,365.133869,45.098919,43.5,60.068,150.4,53.714
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
YR2019,5396.214243,97.650823,30.1,70.373,8.4,79.282,6094.724823,96.370163,33.0,57.265,...,33.40,51.891,14.5,75.993,9215.440875,94.424042,43.8,57.710,8.6,77.832
YR2020,5343.037704,97.867216,29.4,67.363,8.4,76.989,6095.104237,96.613441,33.0,57.784,...,33.55,49.715,14.3,75.292,8638.739133,94.424042,43.0,54.048,8.1,75.850
YR2021,6377.203096,98.083610,29.4,68.684,8.4,76.463,7230.198838,96.856720,33.0,60.999,...,33.70,50.037,14.0,73.772,9743.213131,94.424042,44.4,56.321,7.7,76.032
YR2022,6810.114041,98.300003,29.4,71.367,8.4,76.463,7568.798480,97.099998,33.0,61.552,...,33.70,52.348,14.0,73.772,10674.504173,94.424042,44.4,58.314,7.7,76.032


In [71]:
def experiment(data,country, formula):
    tmp = data[country]
    tmp['Mortality'] = np.log(tmp['Mortality'])
    res = ols(formula, data=tmp).fit()
    return res

### Turkey

Let us fit a model using the same order given in the original data frame.

In [72]:
res = experiment(data,'TUR','Mortality ~ GDP + Literacy + GINI + Participation + Expectancy')
res.summary()
sm.stats.anova_lm(res)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp['Mortality'] = np.log(tmp['Mortality'])


0,1,2,3
Dep. Variable:,Mortality,R-squared:,0.981
Model:,OLS,Adj. R-squared:,0.98
Method:,Least Squares,F-statistic:,605.2
Date:,"Thu, 28 Mar 2024",Prob (F-statistic):,1.0999999999999999e-48
Time:,21:35:11,Log-Likelihood:,37.721
No. Observations:,64,AIC:,-63.44
Df Residuals:,58,BIC:,-50.49
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,14.0203,3.204,4.376,0.000,7.607,20.433
GDP,-7.343e-05,1.08e-05,-6.805,0.000,-9.5e-05,-5.18e-05
Literacy,-0.0271,0.005,-5.690,0.000,-0.037,-0.018
GINI,-0.1296,0.073,-1.766,0.083,-0.277,0.017
Participation,-0.0106,0.007,-1.533,0.131,-0.024,0.003
Expectancy,-0.0276,0.009,-3.015,0.004,-0.046,-0.009

0,1,2,3
Omnibus:,9.959,Durbin-Watson:,0.33
Prob(Omnibus):,0.007,Jarque-Bera (JB):,11.233
Skew:,-0.683,Prob(JB):,0.00364
Kurtosis:,4.532,Cond. No.,1080000.0


Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
GDP,1.0,54.158242,54.158242,2724.721492,1.8731099999999997e-50
Literacy,1.0,5.732678,5.732678,288.413179,3.530806e-24
GINI,1.0,0.060456,0.060456,3.041582,0.08645478
Participation,1.0,0.013322,0.013322,0.670253,0.4163145
Expectancy,1.0,0.180722,0.180722,9.092188,0.003804982
Residual,58.0,1.152844,0.019877,,


When you change the order of the variables, their effect on the variation changes. Compare the previous model with this one.

In [73]:
res = experiment(data,'TUR','Mortality ~ Literacy + Participation + Expectancy + GDP + GINI')
res.summary()
sm.stats.anova_lm(res)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp['Mortality'] = np.log(tmp['Mortality'])


0,1,2,3
Dep. Variable:,Mortality,R-squared:,0.981
Model:,OLS,Adj. R-squared:,0.98
Method:,Least Squares,F-statistic:,605.2
Date:,"Thu, 28 Mar 2024",Prob (F-statistic):,1.0999999999999999e-48
Time:,21:35:12,Log-Likelihood:,37.721
No. Observations:,64,AIC:,-63.44
Df Residuals:,58,BIC:,-50.49
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,14.0203,3.204,4.376,0.000,7.607,20.433
Literacy,-0.0271,0.005,-5.690,0.000,-0.037,-0.018
Participation,-0.0106,0.007,-1.533,0.131,-0.024,0.003
Expectancy,-0.0276,0.009,-3.015,0.004,-0.046,-0.009
GDP,-7.343e-05,1.08e-05,-6.805,0.000,-9.5e-05,-5.18e-05
GINI,-0.1296,0.073,-1.766,0.083,-0.277,0.017

0,1,2,3
Omnibus:,9.959,Durbin-Watson:,0.33
Prob(Omnibus):,0.007,Jarque-Bera (JB):,11.233
Skew:,-0.683,Prob(JB):,0.00364
Kurtosis:,4.532,Cond. No.,1080000.0


Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
Literacy,1.0,59.067875,59.067875,2971.726998,1.588988e-51
Participation,1.0,1.1e-05,1.1e-05,0.000535,0.9816314
Expectancy,1.0,0.038496,0.038496,1.936754,0.1693359
GDP,1.0,0.977064,0.977064,49.156449,2.807273e-09
GINI,1.0,0.061974,0.061974,3.11796,0.08269686
Residual,58.0,1.152844,0.019877,,


After a few tries you'll find that for Turkey the following model has the best $R^2$ score. This indicates that the factor that effects the child mortality in Turkey is mothers' literacy and GDP. The literacy has statistically verifiable negative effect, so does the GDP as you can see below.

In [74]:
res = experiment(data,'TUR','Mortality ~ Literacy + GDP')
res.summary()
sm.stats.anova_lm(res)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp['Mortality'] = np.log(tmp['Mortality'])


0,1,2,3
Dep. Variable:,Mortality,R-squared:,0.977
Model:,OLS,Adj. R-squared:,0.976
Method:,Least Squares,F-statistic:,1298.0
Date:,"Thu, 28 Mar 2024",Prob (F-statistic):,1.02e-50
Time:,21:35:13,Log-Likelihood:,31.338
No. Observations:,64,AIC:,-56.68
Df Residuals:,61,BIC:,-50.2
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.7094,0.127,52.677,0.000,6.455,6.964
Literacy,-0.0391,0.002,-15.763,0.000,-0.044,-0.034
GDP,-6.642e-05,1.11e-05,-5.973,0.000,-8.87e-05,-4.42e-05

0,1,2,3
Omnibus:,2.461,Durbin-Watson:,0.136
Prob(Omnibus):,0.292,Jarque-Bera (JB):,2.168
Skew:,-0.343,Prob(JB):,0.338
Kurtosis:,2.415,Cond. No.,39900.0


Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
Literacy,1.0,59.067875,59.067875,2560.24075,1.586878e-51
GDP,1.0,0.823044,0.823044,35.674075,1.29861e-07
Residual,61.0,1.407344,0.023071,,


### Spain

Let us check what happens with Spain:

In [75]:
res = experiment(data,'ESP','Mortality ~ GDP + Literacy + GINI + Participation + Expectancy')
res.summary()
sm.stats.anova_lm(res)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp['Mortality'] = np.log(tmp['Mortality'])


0,1,2,3
Dep. Variable:,Mortality,R-squared:,0.993
Model:,OLS,Adj. R-squared:,0.992
Method:,Least Squares,F-statistic:,1597.0
Date:,"Thu, 28 Mar 2024",Prob (F-statistic):,9.39e-61
Time:,21:35:15,Log-Likelihood:,71.335
No. Observations:,64,AIC:,-130.7
Df Residuals:,58,BIC:,-117.7
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,19.1006,1.168,16.356,0.000,16.763,21.438
GDP,-9.168e-06,4.74e-06,-1.933,0.058,-1.87e-05,3.24e-07
Literacy,-0.0421,0.012,-3.447,0.001,-0.067,-0.018
GINI,0.0175,0.012,1.512,0.136,-0.006,0.041
Participation,0.0398,0.005,7.716,0.000,0.029,0.050
Expectancy,-0.2084,0.008,-26.033,0.000,-0.224,-0.192

0,1,2,3
Omnibus:,1.603,Durbin-Watson:,0.981
Prob(Omnibus):,0.449,Jarque-Bera (JB):,1.584
Skew:,0.357,Prob(JB):,0.453
Kurtosis:,2.711,Cond. No.,2050000.0


Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
GDP,1.0,47.465493,47.465493,6827.219391,7.243122e-62
Literacy,1.0,2.977515,2.977515,428.272069,1.840223e-28
GINI,1.0,0.00205,0.00205,0.294799,0.5892424
Participation,1.0,0.369093,0.369093,53.088614,9.684905e-10
Expectancy,1.0,4.71179,4.71179,677.722331,1.0971740000000001e-33
Residual,58.0,0.403239,0.006952,,


In [76]:
res = experiment(data,'TUR','Mortality ~ Literacy + GDP')
res.summary()
sm.stats.anova_lm(res)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp['Mortality'] = np.log(tmp['Mortality'])


0,1,2,3
Dep. Variable:,Mortality,R-squared:,0.977
Model:,OLS,Adj. R-squared:,0.976
Method:,Least Squares,F-statistic:,1298.0
Date:,"Thu, 28 Mar 2024",Prob (F-statistic):,1.02e-50
Time:,21:35:15,Log-Likelihood:,31.338
No. Observations:,64,AIC:,-56.68
Df Residuals:,61,BIC:,-50.2
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.7094,0.127,52.677,0.000,6.455,6.964
Literacy,-0.0391,0.002,-15.763,0.000,-0.044,-0.034
GDP,-6.642e-05,1.11e-05,-5.973,0.000,-8.87e-05,-4.42e-05

0,1,2,3
Omnibus:,2.461,Durbin-Watson:,0.136
Prob(Omnibus):,0.292,Jarque-Bera (JB):,2.168
Skew:,-0.343,Prob(JB):,0.338
Kurtosis:,2.415,Cond. No.,39900.0


Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
Literacy,1.0,59.067875,59.067875,2560.24075,1.586878e-51
GDP,1.0,0.823044,0.823044,35.674075,1.29861e-07
Residual,61.0,1.407344,0.023071,,


The same model works pretty well for Spain too. Let us try the others:

### France

In [77]:
res = experiment(data,'FRA','Mortality ~ GDP + Literacy + GINI + Participation + Expectancy')
res.summary()
sm.stats.anova_lm(res)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp['Mortality'] = np.log(tmp['Mortality'])


ValueError: zero-size array to reduction operation maximum which has no identity

We don't have enough data for France. 

### Israel

Let us move to Israel:


In [68]:
wb.data.DataFrame(list(series.values()), economy='ISR').T

series,NY.GDP.PCAP.CD,SE.ADT.LITR.FE.ZS,SI.POV.GINI,SL.TLF.ACTI.ZS,SP.DYN.IMRT.IN,SP.DYN.LE00.IN
YR1960,1229.174748,,,,34.4,
YR1961,1436.384439,,,,32.7,72.006585
YR1962,1094.635848,,,,31.0,72.112195
YR1963,1257.811405,,,,29.5,
YR1964,1375.892256,,,,28.2,
...,...,...,...,...,...,...
YR2019,44452.232562,,38.3,73.200,2.8,82.804878
YR2020,44846.791595,,37.8,71.513,2.8,82.648780
YR2021,52129.515961,,37.9,71.649,2.7,82.500000
YR2022,54930.938808,,,73.337,,


In [78]:
res = experiment(data,'ISR','Mortality ~ GDP + Literacy + GINI + Participation + Expectancy')
res.summary()
sm.stats.anova_lm(res)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp['Mortality'] = np.log(tmp['Mortality'])


0,1,2,3
Dep. Variable:,Mortality,R-squared:,0.982
Model:,OLS,Adj. R-squared:,0.981
Method:,Least Squares,F-statistic:,825.8
Date:,"Thu, 28 Mar 2024",Prob (F-statistic):,4.8e-51
Time:,21:35:23,Log-Likelihood:,50.75
No. Observations:,64,AIC:,-91.5
Df Residuals:,59,BIC:,-80.71
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.0018,0.000,12.102,0.000,0.002,0.002
GDP,-1.015e-05,4.06e-06,-2.498,0.015,-1.83e-05,-2.02e-06
Literacy,0.1618,0.013,12.102,0.000,0.135,0.188
GINI,-0.0116,0.014,-0.838,0.406,-0.039,0.016
Participation,0.0403,0.016,2.463,0.017,0.008,0.073
Expectancy,-0.1862,0.015,-12.109,0.000,-0.217,-0.155

0,1,2,3
Omnibus:,12.901,Durbin-Watson:,0.452
Prob(Omnibus):,0.002,Jarque-Bera (JB):,14.595
Skew:,0.9,Prob(JB):,0.000677
Kurtosis:,4.494,Cond. No.,2.75e+20


Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
GDP,1.0,38.64012,38.64012,2971.331109,3.589046e-52
Literacy,1.0,0.19366,0.19366,14.891992,0.0002846171
GINI,1.0,2.344475,2.344475,180.284409,1.367763e-19
Participation,1.0,0.036959,0.036959,2.842042,0.0971096
Expectancy,1.0,1.870223,1.870223,143.815612,1.849653e-17
Residual,59.0,0.767254,0.013004,,


Israel is an anomaly: While for the other countries we looked (Turkey, Spain) female literacy was the most important factor reducing child mortality, for Israel, it was GDP and life expectancy.

In [93]:
res = experiment(data,'ISR','Mortality ~ GDP + Expectancy')
res.summary()
sm.stats.anova_lm(res)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp['Mortality'] = np.log(tmp['Mortality'])


0,1,2,3
Dep. Variable:,Mortality,R-squared:,0.981
Model:,OLS,Adj. R-squared:,0.98
Method:,Least Squares,F-statistic:,1546.0
Date:,"Thu, 28 Mar 2024",Prob (F-statistic):,5.56e-53
Time:,21:40:12,Log-Likelihood:,47.618
No. Observations:,64,AIC:,-89.24
Df Residuals:,61,BIC:,-82.76
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,16.4501,0.760,21.642,0.000,14.930,17.970
GDP,-4.943e-06,2.93e-06,-1.688,0.097,-1.08e-05,9.14e-07
Expectancy,-0.1845,0.011,-17.478,0.000,-0.206,-0.163

0,1,2,3
Omnibus:,17.574,Durbin-Watson:,0.367
Prob(Omnibus):,0.0,Jarque-Bera (JB):,23.971
Skew:,1.071,Prob(JB):,6.23e-06
Kurtosis:,5.097,Cond. No.,1220000.0


Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
GDP,1.0,38.64012,38.64012,2785.604871,1.2812360000000001e-52
Expectancy,1.0,4.237413,4.237413,305.479361,1.971118e-25
Residual,61.0,0.846153,0.013871,,


### Italy

Let us check Italy:

In [94]:
res = experiment(data,'ITA','Mortality ~ GDP + Literacy + GINI + Participation + Expectancy')
res.summary()
sm.stats.anova_lm(res)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp['Mortality'] = np.log(tmp['Mortality'])


0,1,2,3
Dep. Variable:,Mortality,R-squared:,0.991
Model:,OLS,Adj. R-squared:,0.991
Method:,Least Squares,F-statistic:,1339.0
Date:,"Thu, 28 Mar 2024",Prob (F-statistic):,1.5100000000000002e-58
Time:,21:44:04,Log-Likelihood:,65.388
No. Observations:,64,AIC:,-118.8
Df Residuals:,58,BIC:,-105.8
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,23.6509,4.020,5.884,0.000,15.605,31.697
GDP,6.242e-06,3.99e-06,1.566,0.123,-1.74e-06,1.42e-05
Literacy,-0.0862,0.053,-1.619,0.111,-0.193,0.020
GINI,0.0350,0.012,2.837,0.006,0.010,0.060
Participation,0.0245,0.009,2.704,0.009,0.006,0.043
Expectancy,-0.2071,0.012,-16.954,0.000,-0.232,-0.183

0,1,2,3
Omnibus:,2.387,Durbin-Watson:,0.878
Prob(Omnibus):,0.303,Jarque-Bera (JB):,1.9
Skew:,-0.421,Prob(JB):,0.387
Kurtosis:,3.072,Cond. No.,7900000.0


Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
GDP,1.0,51.800974,51.800974,6187.034067,1.2280460000000001e-60
Literacy,1.0,0.69094,0.69094,82.524843,9.628506e-13
GINI,1.0,0.564973,0.564973,67.479603,2.67809e-11
Participation,1.0,0.575313,0.575313,68.714513,2.008053e-11
Expectancy,1.0,2.406485,2.406485,287.427142,3.836137e-24
Residual,58.0,0.485605,0.008373,,


It seems Italy is a little different.

In [106]:
res = experiment(data,'ITA','Mortality ~ Literacy + Expectancy')
res.summary()
sm.stats.anova_lm(res)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp['Mortality'] = np.log(tmp['Mortality'])


0,1,2,3
Dep. Variable:,Mortality,R-squared:,0.99
Model:,OLS,Adj. R-squared:,0.989
Method:,Least Squares,F-statistic:,2968.0
Date:,"Thu, 28 Mar 2024",Prob (F-statistic):,1.69e-61
Time:,22:30:27,Log-Likelihood:,59.978
No. Observations:,64,AIC:,-114.0
Df Residuals:,61,BIC:,-107.5
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,13.8951,2.351,5.911,0.000,9.194,18.596
Literacy,0.0543,0.032,1.721,0.090,-0.009,0.117
Expectancy,-0.2210,0.010,-22.754,0.000,-0.240,-0.202

0,1,2,3
Omnibus:,0.48,Durbin-Watson:,0.758
Prob(Omnibus):,0.787,Jarque-Bera (JB):,0.15
Skew:,-0.103,Prob(JB):,0.928
Kurtosis:,3.117,Cond. No.,24000.0


Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
Literacy,1.0,51.0683,51.0683,5417.26351,2.71519e-61
Expectancy,1.0,4.880946,4.880946,517.764844,1.6870040000000001e-31
Residual,61.0,0.575044,0.009427,,


### Greece

In [124]:
res = experiment(data,'GRC','Mortality ~ Expectancy + Literacy')
res.summary()
sm.stats.anova_lm(res)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp['Mortality'] = np.log(tmp['Mortality'])


0,1,2,3
Dep. Variable:,Mortality,R-squared:,0.987
Model:,OLS,Adj. R-squared:,0.986
Method:,Least Squares,F-statistic:,2265.0
Date:,"Thu, 28 Mar 2024",Prob (F-statistic):,5.8000000000000005e-58
Time:,22:39:26,Log-Likelihood:,57.487
No. Observations:,64,AIC:,-109.0
Df Residuals:,61,BIC:,-102.5
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,22.5501,0.349,64.560,0.000,21.852,23.249
Expectancy,-0.1784,0.008,-23.721,0.000,-0.193,-0.163
Literacy,-0.0744,0.008,-9.177,0.000,-0.091,-0.058

0,1,2,3
Omnibus:,4.575,Durbin-Watson:,0.445
Prob(Omnibus):,0.102,Jarque-Bera (JB):,4.31
Skew:,-0.634,Prob(JB):,0.116
Kurtosis:,2.918,Cond. No.,3260.0


Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
Expectancy,1.0,45.30706,45.30706,4446.111527,1.044772e-58
Literacy,1.0,0.858219,0.858219,84.219506,4.282908e-13
Residual,61.0,0.621606,0.01019,,


### Egypt

In [149]:
res = experiment(data,'EGY','Mortality ~ Literacy + Expectancy')
res.summary()
sm.stats.anova_lm(res)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tmp['Mortality'] = np.log(tmp['Mortality'])


0,1,2,3
Dep. Variable:,Mortality,R-squared:,0.997
Model:,OLS,Adj. R-squared:,0.996
Method:,Least Squares,F-statistic:,8899.0
Date:,"Fri, 29 Mar 2024",Prob (F-statistic):,5.9e-76
Time:,18:49:35,Log-Likelihood:,101.27
No. Observations:,64,AIC:,-196.5
Df Residuals:,61,BIC:,-190.1
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.7807,0.084,93.000,0.000,7.613,7.948
Literacy,-0.0309,0.001,-32.854,0.000,-0.033,-0.029
Expectancy,-0.0393,0.002,-20.522,0.000,-0.043,-0.035

0,1,2,3
Omnibus:,0.266,Durbin-Watson:,0.171
Prob(Omnibus):,0.875,Jarque-Bera (JB):,0.138
Skew:,-0.113,Prob(JB):,0.933
Kurtosis:,2.968,Cond. No.,1000.0


Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
Literacy,1.0,45.072382,45.072382,17377.23045,1.244424e-76
Expectancy,1.0,1.092332,1.092332,421.138441,4.484325e-29
Residual,61.0,0.158219,0.002594,,


Let us summarize

| Country | Variable 1 | Variable 2 | $R^2$ |
|---------|------------|------------|-------|
| Turkey  | Literacy   |    GDP     | %97.7 |
| Spain   | Literacy   |    GDP     | %97.7 |
| France  |            |            |       |
| Israel  |   GDP      | Expectancy | %98.1 |
| Italy   | Literacy   | Expectancy | %99.0 |
| Greece  | Expectancy | Literacy   | %98.7 |
| Egypt   | Literacy   | Expectancy | %99.7 |


The order of the variables is also important. Try the models with the order of the important variables reversed, and see that the ANOVA results show that the weight of the most important variable drops.  On the other hand, if you add interaction terms the ANOVA results show no additional drop in the variance (look at the residuals.)

## Q2

Get the following commodity price data from yahoo finance using the `yfinance` library:

- Silver (SI=F)
- Copper (HG=F)
- Platinum (PL=F)
- Gold (GC=F)
- Palladium (PA=F)

1. Write a linear regression model that relates the gold futures in terms of the other precious metals.
2. Analyze the regression results.
3. Does the model improve if we add interaction terms? Explain.
4. Now, do the same for each futures in the list above.

## Q3

Use the *Acoustic Extinguisher Fire Dataset* from Murat Köklü's [data server](https://www.muratkoklu.com/datasets/).

1. Explore the dataset, and project it to 2D space using PCA and LDA. Color the data points using the `STATUS` column.
2. Construct an SVM model to model the `STATUS` column and measure its quality using Accuracy, Precision, Recall, and F-1.
3. Construct a Logistic Regression model to model the `STATUS` column and measure its quality using Accuracy, Precision, Recall, and F-1.
4. Using the LR model, determine which variables affect the most the `STATUS` column.

## Q4

Use the hyperspectral image data (ROSIS sensor data over Pavia Italy) we used for Question 2 from HW1 for this question.

1. Load both the image data and the ground truth data. Reshape the image and name is as `vectors` and the ground truth data as `labels`. 
2. Remove all data points whose label is 0.
3. Write a function that construct a multi-label logistic regression model relating `vectors` to `labels`, and analyzes the accuracy using a correct statistical methodology. Analyze the accuracy results.
4. Now, run a model once over a single training and test set. Report the accuracy, precision, recall, and F1 per label basis. 
5. Repeat (3) and (4) for a multi-label SVM model.
6. Construct confusion matrices over a single run for both LR and SVM, and compare. Present your conclusions.