# Introduction to Python for Scientific Computing

This jupyter notebook is inteded to intrduce students to basics of statistical computing using Python. It assumes basic students have a basic understading of data structures in python. Also, it requires students to have taken [Introduction to Probability and Statistics](https://www.khanacademy.org/math/statistics-probability) as they are the basis for all logical thinking and inference. 

## Running a OLS in Python 
The Boston Housing dataset is available in the sklearn library. Let's load it and perform OLS regression to predict the median value of owner-occupied homes (MEDV).

In [60]:
# LOAD Libraries
from IPython.display import HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.datasets import fetch_california_housing


In [61]:
# LOAD Data and process into dataframe 
# California Housing dataset
housing = fetch_california_housing()
# Convert to pandas DataFrame
df_housing = pd.DataFrame(data=housing.data, columns=housing.feature_names)
df_housing['MedHouseVal'] = housing.target
print(df_housing.head(2))

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127    1.02381       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137    0.97188      2401.0  2.109842     37.86   

   Longitude  MedHouseVal  
0    -122.23        4.526  
1    -122.22        3.585  


### Prepare the Data
Define the independent variables (features) and the dependent variable (`MedHouseVal)`.

In [67]:
# Define the independent variables (X) and the dependent variable (y)
X = df_housing.drop('MedHouseVal', axis=1)
y = df_housing['MedHouseVal']

# Add a constant to the independent variables matrix (intercept)
X = sm.add_constant(X)

### Fit the OLS Model
Next, fit the OLS model using statsmodels.

In [68]:
# Fit the OLS model
model = sm.OLS(y, X).fit()

# Print the summary of the regression
#print(model.summary())
with open('model_summary.txt', 'w') as f:
    f.write(model.summary().as_text())
HTML(model.summary().as_html())


0,1,2,3
Dep. Variable:,MedHouseVal,R-squared:,0.606
Model:,OLS,Adj. R-squared:,0.606
Method:,Least Squares,F-statistic:,3970.0
Date:,"Tue, 02 Jul 2024",Prob (F-statistic):,0.0
Time:,14:37:54,Log-Likelihood:,-22624.0
No. Observations:,20640,AIC:,45270.0
Df Residuals:,20631,BIC:,45340.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-36.9419,0.659,-56.067,0.000,-38.233,-35.650
MedInc,0.4367,0.004,104.054,0.000,0.428,0.445
HouseAge,0.0094,0.000,21.143,0.000,0.009,0.010
AveRooms,-0.1073,0.006,-18.235,0.000,-0.119,-0.096
AveBedrms,0.6451,0.028,22.928,0.000,0.590,0.700
Population,-3.976e-06,4.75e-06,-0.837,0.402,-1.33e-05,5.33e-06
AveOccup,-0.0038,0.000,-7.769,0.000,-0.005,-0.003
Latitude,-0.4213,0.007,-58.541,0.000,-0.435,-0.407
Longitude,-0.4345,0.008,-57.682,0.000,-0.449,-0.420

0,1,2,3
Omnibus:,4393.65,Durbin-Watson:,0.885
Prob(Omnibus):,0.0,Jarque-Bera (JB):,14087.596
Skew:,1.082,Prob(JB):,0.0
Kurtosis:,6.42,Cond. No.,238000.0


## Running Logistic Regression with Binary Dependent Variable

In [72]:
import pandas as pd
from sklearn.datasets import load_iris
import statsmodels.api as sm

# Load Iris dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Keep only two classes (setosa and versicolor)
df = df[df['target'] != 2]

# Define X (features) and y (target)
X = df.drop('target', axis=1)
y = df['target']

# Add constant to X (for intercept)
X = sm.add_constant(X)
X.head()

Unnamed: 0,const,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,1.0,5.1,3.5,1.4,0.2
1,1.0,4.9,3.0,1.4,0.2
2,1.0,4.7,3.2,1.3,0.2
3,1.0,4.6,3.1,1.5,0.2
4,1.0,5.0,3.6,1.4,0.2


In [73]:
# Fit logistic regression model
logit_model = sm.Logit(y, X)
result = logit_model.fit()

# Print model summary
print(result.summary())


         Current function value: 0.000000
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:                 target   No. Observations:                  100
Model:                          Logit   Df Residuals:                       95
Method:                           MLE   Df Model:                            4
Date:                Tue, 02 Jul 2024   Pseudo R-squ.:                   1.000
Time:                        14:47:58   Log-Likelihood:            -8.9814e-06
converged:                      False   LL-Null:                       -69.315
Covariance Type:            nonrobust   LLR p-value:                 5.547e-29
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                 9.6813    1.2e+04      0.001      0.999   -2.35e+04    2.35e+04
sepal length (cm)    -4.1173   3316.583     



## CREATE Sample dataset

In [28]:
# CREATE Pseudo random data 
df = pd.DataFrame(np.random.randint(0,1000, size=(100, 4)), columns=list('ABCD'))
df['b']= 30
df= df[['A', 'B','b' ,'C', 'D' ]]
df['y'] = df ['B']  - df ['b']
df['o'] =  df['y'].apply(lambda x: 60 if x<= 5 and x>= -5 else   x + 50 )
df

Unnamed: 0,A,B,b,C,D,y,o
0,375,766,30,826,257,736,786
1,305,495,30,667,713,465,515
2,57,258,30,888,538,228,278
3,912,208,30,795,416,178,228
4,893,175,30,622,555,145,195
...,...,...,...,...,...,...,...
95,304,935,30,955,499,905,955
96,727,497,30,964,52,467,517
97,577,479,30,27,586,449,499
98,780,187,30,676,872,157,207


In [30]:
result = sm.ols(formula="o ~ b",  data=df).fit()
print(result.params)
print(result.summary())


Intercept     0.627159
b            18.814761
dtype: float64
                            OLS Regression Results                            
Dep. Variable:                      o   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                       nan
Date:                Tue, 02 Jul 2024   Prob (F-statistic):                nan
Time:                        12:10:59   Log-Likelihood:                -699.70
No. Observations:                 100   AIC:                             1401.
Df Residuals:                      99   BIC:                             1404.
Df Model:                           0                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------