## Course Assignment Instructions
You should have Python (version 3.8 or later) and Jupyter Notebook installed to complete this assignment. You will write code in the empty cell/cells below the problem. While most of this will be a programming assignment, some questions will ask you to "write a few sentences" in markdown cells. 

Submission Instructions:

Create a labs directory in your personal class repository (e.g., located in your home directory)
Clone the class repository
Copy this Jupyter notebook file (.ipynb) into your repo/labs directory
Make your edits, commit changes, and push to your repository
All submissions must be pushed before the due date to avoid late penalties. 

Labs are graded out of a 100 pts. Each day late is -10. For a max penalty of -50 after 5 days. From there you may submit the lab anytime before the semester ends for a max score of 50.  

Lab 6 is due on 3/24/2025

#Visualization in Python

Load up the `GSSvocab.csv` dataset into a pandas dataframe and and drop the rows with missing values.

What is the data type of each variable? What do you think is the response variable the collectors of this data had in mind?

There are 8 variables: year, gender, nativeBorn, ageGroup, educGroup, vocab, age, and educ.
Year, gender, nativeBorn, ageGroup, eduGroup, educ are categorical variables.
Age and education are continuous variables.
The response variable I think was to see what features correlated with a higher vocabulary.

Create two different plots and identify the best-looking plot you can to examine the `age` variable. Save the best looking plot as an appropriately-named PDF.

Using plotnine

In [None]:
from plotnine import ggplot, aes, geom_histogram, geom_density, labs, ggsave

# Plot 1: Histogram of age with 50 bins
hist_plot = (
    ggplot(df, aes(x='age')) +
    geom_histogram(bins=50) +
    labs(x='Age', y='Frequency', title='Histogram of Age')
)

# Plot 2: Density plot of age with blue fill
density_plot = (
    
)

# Display the plots (if using an interactive environment, they will be rendered)
hist_plot

Using Seaborn and Matplotlib

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns



We will use plotninem (https://plotnine.org/) as our visualization tool for the first half of this lab. Create two different plots and identify the best looking plot you can to examine the `vocab` variable. Save the best looking plot as an appropriately-named PDF.

In [None]:
from plotnine import ggplot, aes, geom_bar, geom_point, labs, theme_minimal, ggsave, position_jitter

# Assume df is your GSSvocab DataFrame
# Ensure that vocab is treated as a categorical variable

# ---- Plot 1: Bar Plot ----
bar_plot = (ggplot(df, aes(x='vocab')) +
            


# Display the plots (in an interactive environment, these will render)
print(bar_plot)
bar_plot

In [None]:
# Create a jitter plot for 'vocab'


jitter_plot

Create the best-looking plot you can to examine the `ageGroup` variable by `gender`. Does there appear to be an association? There are many ways to do this.

In [None]:
import pandas as pd
from plotnine import ggplot, aes, geom_violin, labs, theme_minimal

# Assume df is your DataFrame containing the columns 'ageGroup' and 'gender'.
# Make sure 'gender' is treated as a categorical variable:


# Create a violin plot by switching the axes:


violin_plot

In [None]:
import pandas as pd
from plotnine import ggplot, aes, geom_jitter, labs, theme_minimal

# Assume df is your DataFrame containing the columns 'ageGroup' and 'gender'
# Ensure 'gender' is treated as a categorical variable
df['gender'] = df['gender'].astype('category')



jitter_plot

Create the best-looking plot you can to examine the `vocab` variable by `age`. Does there appear to be an association?

In [None]:
import pandas as pd
from plotnine import ggplot, aes, geom_jitter, labs, theme_minimal

# Create a jitter plot to examine the relationship between age and vocab
plot = (ggplot()) +
        geom_jitter() +
        
        theme_minimal())

plot


Add an estimate of $f(x)$ using the smoothing geometry to the previous plot. Does there appear to be an association now? First install pygam by uncommenting and running the cell below and then fill in the missing block in the subsequent cell.

In [None]:
#pip install pygam

In [None]:
import pandas as pd
import numpy as np
from pygam import LinearGAM, s
from plotnine import ggplot, aes, geom_point, geom_line, labs, theme_minimal, scale_y_continuous

# Assume df is your DataFrame with 'age' and 'vocab' columns.
# Ensure 'vocab' is numeric (if it's not already)
df['vocab'] = pd.to_numeric(df['vocab'], errors='coerce')

# Fit a GAM model for vocab ~ s(age)
X = df[['age']].values  
y = df['vocab'].values
gam = LinearGAM(s(0)).fit(X, y)

# Create a grid of age values for prediction
age_grid = np.linspace(df['age'].min(), df['age'].max(), 100)
gam_preds = gam.predict(age_grid)

# Create a DataFrame with the predictions
gam_df = pd.DataFrame({
    'age': age_grid,
    'vocab': gam_preds
})

# Create the plot with y-axis limits between 4.8 and 6.8
plot = ()

plot


Using the plot from the previous question, create the best looking plot overloading with variable `gender`. Does there appear to be an interaction of `gender` and `age`?

In [None]:
from plotnine import ggplot, aes, geom_jitter, geom_smooth, labs, theme_minimal

# Assume df is your DataFrame containing 'age', 'vocab', and 'gender'
# For example, df = pd.read_csv("GSSvocab.csv")


plot

Using the plot from the previous question, create the best looking plot overloading with variable `nativeBorn`. Does there appear to be an interaction of `nativeBorn` and `age`?

In [None]:
from plotnine import ggplot, aes, geom_jitter, geom_smooth, labs, theme_minimal

# Assume df is your GSSvocab DataFrame containing 'age', 'vocab', and 'nativeBorn'

plot


Create two different plots and identify the best-looking plot you can to examine the `vocab` variable by `educGroup`. Does there appear to be an association?

In [None]:
import pandas as pd
from plotnine import ggplot, aes, geom_boxplot, geom_density, labs, theme_minimal

# Assume df is your GSSvocab DataFrame containing the columns 'vocab' and 'educGroup'
# Ensure that 'educGroup' is treated as a categorical variable
df['educGroup'] = df[].astype('category')

# ---- Plot 1: Boxplot of vocab by educGroup ----
boxplot = (ggplot(df, aes(x='educGroup', y='vocab')) +
           geom_boxplot() +
          
boxplot


In [None]:
# ---- Plot 2: Density Plot of vocab with fill by educGroup ----

density_plot

Using the best-looking plot from the previous question, create the best looking overloading with variable `gender`. Does there appear to be an interaction of `gender` and `educGroup`?

Using facets, examine the relationship between `vocab` and `ageGroup`. You can drop year level `(Other)`. Are we getting dumber?

In [None]:
import pandas as pd
from plotnine import ggplot, aes, geom_density, facet_grid, labs, theme_minimal

# Assume df is your GSSvocab DataFrame.
# Drop the unwanted level "(Other)" from ageGroup.
df_subset = df[]

# Create the density plot faceted by ageGroup.


plot

#Logistic Regression

Let's consider the Pima Indians Diabetes dataset from 1988:

In [None]:
import statsmodels.api as sm

# Load the Pima.tr2 dataset from the MASS package
pima_dataset = sm.datasets.get_rdataset("Pima.tr2", package="MASS")
df = pima_dataset.data

# Display the first few rows


Note the missing data. We will learn about how to handle missing data towards the end of the course. For now, replace, the missing data in the design matrix X with the mean of the feature x_dot,j. 

In [None]:
# Create the design matrix X with an intercept column 


# Replace missing values in each column with the mean of that column


# Verify that missing values have been replaced


Now let's fit a log-odds linear model of y=1 (type is "diabetic") on just the `glu` variable. Import minimize from scipy.optimize to fit the model.

In [None]:
import numpy as np
from scipy.optimize import minimize

y = pima['type'].values
X = pima['glu'].values

# Define the negative log-likelihood function for logistic regression
def neg_loglik(beta):
  
    return 

# Use minimize from SciPy to optimize the negative log-likelihood
result = minimize()
print(result)

Extra Credit(+5): write a `fit_logistic_regression` function which takes in X, y and returns b which uses the optimization routine.

Run a logistic regression of y=1 (type is "diabetic") on just the `glu` variable using sm from statsmodels.api and report b_0, b_1.

In [None]:
import statsmodels.api as sm

# Add a constant column for the intercept

# Fit the logistic regression model using y as the response and glu as the predictor


# Extract the coefficients: b0 (intercept) and b1 (for glu)
coef = 

print("b0 (Intercept):", )
print("b1 (glu):", )

Comment on how close the results from Statsmodels built in function was to your optimization call.

Interpret the value of b_1 from Statsmodels smf module.

Interpret the value of b_0 from Statsmodels smf module.

Plot the probability of y=1 from the minimum value of `glu` to the maximum value of `glu`.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Create a new DataFrame with the glu values sorted from min to max


# Add the constant column so that newdata matches the model's predictors


# Compute the predicted probabilities for y = 1 using the fitted model


# Plot the predicted probability curve 


Run a logistic regression of y=1 (type is "diabetic") on all variables using statsmodels sm module and report the b vector.

In [None]:
# Grab all the columns except 'type'
X = pima.iloc[:,:8]

# Fit the logistic regression model using y as the response and glu as the predictor
model = sm.Logit(y, X).fit()

# Report the estimated coefficients (b vector)
print("Coefficient vector (b):")
print(model.params)

Predict the probability of diabetes for someone with a blood sugar of 150.

In [None]:
import pandas as pd

# Compute means for the other variables


# Create a new data point, using 150 for glu and the means for the others
glu_150 = pd.DataFrame({
    
    'npreg': [predictor_means['npreg']],

    'bp': [predictor_means['bp']],
    'skin': [predictor_means['skin']],
    'bmi': [predictor_means['bmi']],
    'ped': [predictor_means['ped']],
    'age': [predictor_means['age']]
})

# Predict using the fitted model

print("Predicted probability of diabetes for blood sugar 150:", )

For 100 people with blood sugar of 150, what is the probability more than 75 of them have diabetes? (You may need to review 241 to do this problem).

In [None]:


# Compute the probability that more than 75 of them have diabetes.
# This is: 1 - P(X <= 75)


print("Probability that more than 75 out of 100 people have diabetes:", )

Plot the in-sample log-odds predictions (y-axis) versus the real response values (x-axis).

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Compute the in-sample log-odds predictions (linear predictor)


# Create a DataFrame that holds the real responses and the log-odds predictions
df_plot = pd.DataFrame({
    'Real_Response': y,           # actual binary response values (0 or 1)
    'Predicted_LogOdds': log_odds_predictions
})

# Plot using seaborn's scatterplot
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df_plot, x=, y=, color=)
plt.xlabel("")
plt.ylabel("")
plt.title("")
plt.grid(True)
plt.show()

Plot the in-sample probability predictions (y-axis) versus the real response values (x-axis).

In [None]:
# Compute predicted probabilities using your fitted logistic regression model.
# X is your original design matrix with the intercept

# Create a DataFrame that holds the actual binary responses and the predicted probabilities.
df_plot = pd.DataFrame({
    'Real_Response': y,                 # Actual response values (0 or 1)
    'Predicted_Probability': predicted_probs
})

# Plot using seaborn's scatterplot
plt.figure(figsize=(8, 6))
sns.scatterplot(data=df_plot, x='Real_Response', y='Predicted_Probability', color='blue')
plt.xlabel()
plt.ylabel()
plt.title()
plt.grid(True)
plt.show()


Comment on how well you think the logistic regression performed in-sample.

Calculate the in-sample Brier score.

In [None]:
import numpy as np

# Compute the in-sample Brier score
# Brier Score = (1/n) * sum( (y - predicted_prob)^2 )
brier_score = np.mean(()

print("In-sample Brier score:", )

Calculate the in-sample log-scoring rule.

In [None]:
import numpy as np

# To avoid taking log(0), add a small constant epsilon
epsilon = 1e-9

# Calculate the negative average log-likelihood
# Also known as the log scoring rule
log_score = -np.mean(
    
)

print("In-sample log scoring rule:", log_score)

Run a probit regression of y=1 (type is "diabetic") on all variables and report the b vector.

In [None]:
# Grab all the columns except 'type'
X = pima.iloc[:,:8]

# Fit the probit regression model using sm.Probit
model = 

# Report the estimated coefficient vector (b)
print("Coefficient vector (b):")
print(model.params)

Does the weight estimates here in the probit fit have different signs than the weight estimates in the logistic fit? What does that mean?

Plot the in-sample probability predictions (y-axis) versus the real response values (x-axis).

In [None]:
# Compute predicted probabilities using your fitted logistic regression model.
probit_probs = model.predict(X)  # X is your original design matrix with the intercept

# Create a DataFrame that holds the actual binary responses and the predicted probabilities.
df_plot = pd.DataFrame({
    'Real_Response': y,                 # Actual response values (0 or 1)
    'Predicted_Probability': probit_probs
})

# Plot using seaborn's scatterplot


Calculate the in-sample Brier score.

In [None]:
import numpy as np

# Compute the in-sample Brier score
# Brier Score = (1/n) * sum( (y - predicted_prob)^2 )
Probit_brier_score = 

print("In-sample Brier score:", Probit_brier_score)

Calculate the in-sample log-scoring rule.

In [None]:
import numpy as np

# To avoid taking log(0), add a small constant epsilon
epsilon = 1e-9

# Calculate the negative average log-likelihood
# Also known as the log scoring rule
Probit_log_score = -np.mean(
   
)

print("In-sample log scoring rule:", )

Which model did better in-sample?

Compare both models oos using the Brier score and a test set with 1/3 of the data.

In [None]:
from sklearn.model_selection import train_test_split

# Grab all the columns except 'type'
X = pima.iloc[:,:8]
y = pima['type'].values

# Split the data: 2/3 training, 1/3 test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=1/3, random_state=42
)

# Fit the logistic regression (logit) model on the training set


# Fit the probit model on the training set


# Predict probabilities on the test set for both models


# Calculate the out-of-sample Brier score for each model
# Brier Score = mean( (actual - predicted)^2 )


print("Out-of-sample Brier Score (Logit):", brier_logit)
print("Out-of-sample Brier Score (Probit):", brier_probit)

Which model did better oos?