# Time Series Anaylsis
## DSC 530 - Week Nine
### McKenzie Payne

> Blackboard Instruction

> Complete the following exercises:
- Page 142: 11-1 (Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth…)
- Page 142: 11-3 (If the quantity you want to predict is a count, you can use Poisson regression, which is implemented in StatsModels with a function call poisson…)
- Page 143: 11-4 (If the quantity you want to predict is categorical, you can use multinomial logistic regression, which is implemented in StatsModels with a function called mnlogit…)

## Code Provided by Author:

In [1]:
from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkstats2.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/thinkplot.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/nsfg.py")

download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dct")
download(
    "https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dat.gz"
)


import numpy as np
import random
import thinkstats2
import thinkplot

## Excercise 11-1:

#### Exercise: Suppose one of your co-workers is expecting a baby and you are participating in an office pool to predict the date of birth. Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? You should limit yourself to variables that are known before the birth, and likely to be available to the people in the pool.

In [2]:
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/nsfg.py")
download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/first.py")

download("https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dct")
download(
    "https://github.com/AllenDowney/ThinkStats2/raw/master/code/2002FemPreg.dat.gz"
)

import first
live, firsts, others = first.MakeFrames()

> The code below imports the "first" module, calls the "MakeFrames" function to create three DataFrames (live_births, first_births, and other_births), and then filters the "live_births" DataFrame to include only rows where the "prglngth" column has values greater than 30.

In [3]:
import first
live_births, first_births, other_births = first.MakeFrames()
live_births = live_births[live_births['prglngth'] > 30]

> The code below uses the statsmodels library to perform an ordinary least squares (OLS) regression on the "live" DataFrame, modeling the "prglngth" variable as a function of three predictor variables: "birthord" being equal to 1, "race" being equal to 2, and "nbrnaliv" being greater than 1, and then it displays a summary of the regression results.

In [4]:
import statsmodels.formula.api as smf
model = smf.ols('prglngth ~ birthord==1 + race==2 + nbrnaliv>1', data=live)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,prglngth,R-squared:,0.025
Model:,OLS,Adj. R-squared:,0.024
Method:,Least Squares,F-statistic:,77.25
Date:,"Sun, 29 Oct 2023",Prob (F-statistic):,2.42e-49
Time:,10:59:52,Log-Likelihood:,-21960.0
No. Observations:,9148,AIC:,43930.0
Df Residuals:,9144,BIC:,43960.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,38.3649,0.053,718.529,0.000,38.260,38.470
birthord == 1[T.True],0.0287,0.056,0.513,0.608,-0.081,0.138
race == 2[T.True],0.3634,0.058,6.229,0.000,0.249,0.478
nbrnaliv > 1[T.True],-2.9210,0.211,-13.833,0.000,-3.335,-2.507

0,1,2,3
Omnibus:,5996.608,Durbin-Watson:,1.65
Prob(Omnibus):,0.0,Jarque-Bera (JB):,132812.832
Skew:,-2.805,Prob(JB):,0.0
Kurtosis:,20.804,Cond. No.,10.0


#### Assuming that bets are placed during the 30th week of pregnancy, what variables could you use to make the best prediction? 

#### The chosen predictors used to model and predict the length of pregnancy ("prglngth") are as follows:
 1. birthord==1: This predictor checks if the pregnancy is the first birth (coded as 1 if it is, 0 if it's not).
 2. race==2: This predictor checks if the race is coded as 2.
 3. nbrnaliv>1: This predictor checks if the number of babies born alive is greater than 1, which could indicate a multiple birth.

## Excercise 11-3:

#### If the quantity you want to predict is a count, you can use Poisson regression, which is implemented in StatsModels with a function called poisson. It works the same way as ols and logit. As an exercise, let’s use it to predict how many children a woman has born; in the NSFG dataset, this variable is called numbabes.

> The code first imports the "nsfg" module, filters the "live" DataFrame to include rows where the "prglngth" column is greater than 30, reads data from the "nsfg" module into the "resp" DataFrame, sets the index of "resp" to the "caseid" column, and then joins the "live" and "resp" DataFrames based on the "caseid" column, resulting in a combined DataFrame "join," and it displays the shape (number of rows and columns) of the resulting DataFrame.

In [5]:
import nsfg
live = live[live.prglngth>30]
resp = nsfg.ReadFemResp()
resp.index = resp.caseid
join = live.join(resp, on='caseid', rsuffix='_r')
join.shape

(8884, 3331)

> The code below replaces occurrences of the value 97 in the 'numbabes' column of the 'join' DataFrame with NaN and creates a new 'age2' column containing the square of the values in the 'age_r' column.

In [7]:
join.numbabes.replace([97], np.nan, inplace=True)
join['age2'] = join.age_r**2

The code below defines a Poisson regression model with specified independent variables, fits it to the data in the 'join' DataFrame, and displays a summary of the regression results.

In [8]:
formula = 'numbabes ~ age_r + age2 + C(race) + totincr + educat'
model = smf.poisson(formula, data=join)
results = model.fit()
results.summary()


Optimization terminated successfully.
         Current function value: 1.677002
         Iterations 7


0,1,2,3
Dep. Variable:,numbabes,No. Observations:,8884.0
Model:,Poisson,Df Residuals:,8877.0
Method:,MLE,Df Model:,6.0
Date:,"Sun, 29 Oct 2023",Pseudo R-squ.:,0.03686
Time:,11:02:45,Log-Likelihood:,-14898.0
converged:,True,LL-Null:,-15469.0
Covariance Type:,nonrobust,LLR p-value:,3.681e-243

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-1.0324,0.169,-6.098,0.000,-1.364,-0.701
C(race)[T.2],-0.1401,0.015,-9.479,0.000,-0.169,-0.111
C(race)[T.3],-0.0991,0.025,-4.029,0.000,-0.147,-0.051
age_r,0.1556,0.010,15.006,0.000,0.135,0.176
age2,-0.0020,0.000,-13.102,0.000,-0.002,-0.002
totincr,-0.0187,0.002,-9.830,0.000,-0.022,-0.015
educat,-0.0471,0.003,-16.076,0.000,-0.053,-0.041


#### Suppose you meet a woman who is 35 years old, black, and a college graduate whose annual household income exceeds $75,000. How many children would you predict she has born?

In [16]:
import pandas as pd
columns = ['age_r', 'age2', 'age3', 'race', 'totincr', 'educat']
new = pd.DataFrame([[35, 35**2, 35**3, 1, 14, 16]], columns=columns)
predicted_children = results.predict(new)
print(predicted_children)

          0         1         2         3         4         5
0  0.782389  0.048214  0.001278  0.065284  0.032845  0.069991


> Based on the results, the model predicts that a woman who is 35 years old, black, a college graduate, and with an annual household income exceeding $75,000 is expected to have approximately 2.496802 children.

## Excericse 11-4: 

#### Exercise: If the quantity you want to predict is categorical, you can use multinomial logistic regression, which is implemented in StatsModels with a function called mnlogit. As an exercise, let’s use it to guess whether a woman is married, cohabitating, widowed, divorced, separated, or never married; in the NSFG dataset, marital status is encoded in a variable called rmarital. Suppose you meet a woman who is 25 years old, white, and a high school graduate whose annual household income is about $45,000. What is the probability that she is married, cohabitating, etc?

> This code defines a multinomial logistic regression model, fits it to the data in the 'join' DataFrame using the specified formula, and displays a summary of the regression results.


In [14]:
formula = 'rmarital ~ age_r + age2 + C(race) + totincr + educat'
model = smf.mnlogit(formula, data=join)
results = model.fit()
results.summary()

Optimization terminated successfully.
         Current function value: 1.084053
         Iterations 8


0,1,2,3
Dep. Variable:,rmarital,No. Observations:,8884.0
Model:,MNLogit,Df Residuals:,8849.0
Method:,MLE,Df Model:,30.0
Date:,"Sun, 29 Oct 2023",Pseudo R-squ.:,0.1682
Time:,11:36:34,Log-Likelihood:,-9630.7
converged:,True,LL-Null:,-11579.0
Covariance Type:,nonrobust,LLR p-value:,0.0

rmarital=2,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,9.0156,0.805,11.199,0.000,7.438,10.593
C(race)[T.2],-0.9237,0.089,-10.418,0.000,-1.097,-0.750
C(race)[T.3],-0.6179,0.136,-4.536,0.000,-0.885,-0.351
age_r,-0.3635,0.051,-7.150,0.000,-0.463,-0.264
age2,0.0048,0.001,6.103,0.000,0.003,0.006
totincr,-0.1310,0.012,-11.337,0.000,-0.154,-0.108
educat,-0.1953,0.019,-10.424,0.000,-0.232,-0.159
rmarital=3,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,2.9570,3.020,0.979,0.328,-2.963,8.877
C(race)[T.2],-0.4411,0.237,-1.863,0.062,-0.905,0.023


#### Make a prediction for a woman who is 25 years old, white, and a high school graduate whose annual household income is about $45,000.

In [17]:
columns = ['age_r', 'age2', 'race', 'totincr', 'educat']
new = pd.DataFrame([[25, 25**2, 2, 11, 12]], columns=columns)
results.predict(new)

Unnamed: 0,0,1,2,3,4,5
0,0.750028,0.126397,0.001564,0.033403,0.021485,0.067122


>  The code calculates and provides the predicted marital status for a woman with the specified characteristics (25 years old, white, high school graduate, and an annual household income of about $45,000) based on the results of a multinomial logistic regression model. This prediction is made by the `results.predict(new)` line, and it estimates the likelihood of different marital statuses for a person with those characteristics.

#### Our results indicate the specific probabilities associated with the predicted marital status for the scenario given, indicating that there is a 75% chance of being currently married, a 13% chance of being "not married but living with an opposite-sex partner," and so on, according to the model's calculations.