In [None]:
import os
import pandas as pd
import geopandas as gpd

# new import statements


# ML overview

#### Covid deaths analysis

- Source: https://data.dhsgis.wi.gov/
    - Specifically, let's analyze "COVID-19 Data by Census Tract V2": https://data.dhsgis.wi.gov/datasets/wi-dhs::covid-19-data-by-census-tract-v2/explore
        - Status Flag Values: -999: Census tracts, municipalities, school districts, and zip codes with 0–4 aggregate counts for any data have been suppressed. County data with 0-4 aggregate counts by demographic factors (e.g., by age group, race, ethnicity) have been suppressed.

In [None]:
# Do not reptitivitely download large datasets
# Save a local copy instead
dataset_file = "covid.geojson"
if os.path.exists(dataset_file):
    print("Reading local file.")
    df = gpd.read_file(dataset_file)
else:
    print("Downloading the dataset.")
    url = "https://dhsgis.wi.gov/server/rest/services/DHS_COVID19/COVID19_WI_V2/MapServer/9/query?outFields=*&where=1%3D1&f=geojson"
    # Read the geojson data
    
    # Save it to a local file (dataset_file)
    

In [None]:
df.head()

In [None]:
# Explore the columns
df

In [None]:
# Create a geographic plot
df

### Predicting "DTH_CUM_CP"

### How can we get a clean dataset of COVID deaths in WI?

In [None]:
# Replace -999 with 2; 2 is between 0-4; random choice instead of using 0
df = 
# we must communicate in final results what percent of values were guessed (imputed)

How would we know if the data is now clean?

In [None]:
# Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
df

Which points are concerning? Let's take a closer look.

#### Which rows have "DTH_CUM_CP" greater than 300?

In [None]:
df["DTH_CUM_CP"]

#### Valid rows have "GEOID" that only contains digits

Using `str` methods to perform filtering: `str.fullmatch` does a full string match given a reg-ex. Because it does full string match anchor characters (`^`, `$`) won't be needed.

In [None]:
df["GEOID"]

In [None]:
df = df[df["GEOID"].str.fullmatch(r"\d+")]
df.plot.scatter(x="POP", y="DTH_CUM_CP")

### How can we train/fit models to known data to predict unknowns?
- Feature(s) => Predictions
    - Population => Deaths
    - Cases => Deaths
    - Cases by Age => Deaths
    
- General structure for fitting models:
    ```python
    model = <some model>
    model.fit(X, y)
    y = model.predict(X)
    ```
    - where `X` needs to be a matrix or a `DataFrame` and `y` needs to be an array (vector) or a `Series`
    - after fitting, `model` object instance stores the information about relationship between features (x values) and predictions (y values)
    - `predict` returns a `numpy` array, which can be treated like a list

### Predicting "DTH_CUM_CP" using "POP" as feature.

In [None]:
# We must specify a list of columns to make sure we extract a DataFrame and not a Series
# Feature DataFrame
df

In [None]:
# Label Series: "DTH_CUM_CP"
df

### Let's use `LinearRegression` model.

- `from sklearn.linear_model import LinearRegression`

In [None]:
xcols = 
ycol = 

model = 
model
# less interesting because we are predicting what we already know
y = model

Predicting for new values of x.

In [None]:
predict_df = pd.DataFrame({"POP": [1000, 2000, 3000]})
predict_df

In [None]:
# Predict for the new data


In [None]:
# Insert a new column called "predicted deaths" with the predictions
predict_df["predicted deaths"] = model.predict(predict_df)
predict_df

### How can we visualize model predictions?

- Let's predict deaths for "POP" ranges like 0, 1000, 2000, ..., 20000

In [None]:
predict_df = pd.DataFrame({"POP": range(0, 20000, 1000)})
predict_df

In [None]:
# Insert a new column called "predicted deaths" with the predictions
predict_df["predicted deaths"] = model.predict(predict_df)
predict_df

In [None]:
# Create a line plot to visualize relationship between "POP" and "predicted deaths"
ax = predict_df.plot.line(x="POP", y="predicted deaths", color="r")
# Create a scatter plot to visualize relationship between "POP" and "DTH_CUM_CP"
df.plot.scatter(x="POP", y="DTH_CUM_CP", ax=ax, color="k", alpha=0.05)

### How can we get a formula for the relationship?

- `y=mx+c`, where `y` is our predictions and `x` are the features used for the fit
    - Slope of the line (`m`) given by `model.coef_[0]`
    - Intercept of the line (`c`) given by `model.intercept_`

Model coefficients

In [None]:
model

In [None]:
# Slope of the line
model.coef_

In [None]:
# Intercept of the line
model

In [None]:
print(f"deaths ~= {round(model.coef_[0], 4)} * population + {round(model.intercept_, 4)}")

### How well does our model fit the data?
- explained variance score
- R^2 ("r squared")

#### `sklearn.metrics.explained_variance_score(y_true, y_pred)`
- requires `import sklearn`
- calculates the explained variance score given:
    - y_true: actual death values in our example
    - y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.explained_variance_score.html

In [None]:
xcols, ycol

In [None]:
# Let's now make predictions for the known data
predictions = model
predictions

In [None]:
sklearn.metrics.explained_variance_score(, )

#### Explained variance score

- `explained_variance = (known_var - explained_variance) / known_var`
    - where `known_var = y_true.var()` and `explained_variance = (y_true - y_pred).var()`

What is the variation in known deaths?

In [None]:
# Compute variance of "DTH_CUM_CP" column
known_var = df[ycol]
known_var

In [None]:
# explained_variance
explained_variance = (df[ycol] - predictions).var()   
explained_variance

In [None]:
# explained_variance score
explained_variance_score = (known_var - explained_variance) / known_var
explained_variance_score

In [None]:
# For comparison here is the explained variance score from sklearn
sklearn.metrics.explained_variance_score(df[ycol], predictions)

#### `sklearn.metrics.r2_score(y_true, y_pred)`

- requires `import sklearn`
- calculates the explained variance score given:
    - y_true: actual death values in our example
    - y_pred: prediction of deaths in our example
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html 

In [None]:
sklearn.metrics.r2_score(df[ycol], predictions)

#### R^2 score (aka coefficient of determination) approximation

- `r2_score = (known_var - r2_val) / known_var`
    - where `known_var = y_true.var()` and `r2_val = ((y_true - y_pred) ** 2).mean()`

In [None]:
# r2_val
r2_val = ((df[ycol] - predictions) ** 2).mean()
r2_val

In [None]:
r2_score = (known_var - r2_val) / known_var
r2_score # there might be minor rounding off differences

#### `model.score(X, y)`
- invokes `predict` method for calculating predictions (`y`) based on features (`X`) and compares the predictions with true values of y

In [None]:
model

#### Did our model learn, or just memorize (that is, "overfit")?

- Split data into train and test

In [None]:
# Split the data into two equal parts
len(df) // 2

In [None]:
# Manual way of splitting train and test data
train, test = df.iloc[:len(df)//2], df.iloc[len(df)//2:]
len(train), len(test)

Problem with manual splitting is, we need to make sure that the data is not sorted in some way.

#### `train_test_split(<dataframe>, test_size=<val>)`

- requires `from sklearn.model_selection import train_test_split`
- shuffles the data and then splits based on 75%-25% split between train and test
    - produces new train and test data every single time
- `test_size` parameter can take two kind of values:
    - actual number of rows that we want in test data
    - fractional number representing the ratio of train versus test data
    - default value is `0.25`
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:

len(train), len(test)

In [None]:
# Test size using row count
train, test = train_test_split(df, test_size=120)
len(train), len(test)

In [None]:
# Test size using fraction
train, test = train_test_split(df, test_size=0.5)
len(train), len(test)

In [None]:
# Running this cell twice will give you two different train datasets
train, test = train_test_split(df)
train.head()

In [None]:
train, test = train_test_split(df)

# Let's use the train and the test data
model = LinearRegression()
# Fit using training data
model.fit(, )
# Predict using test data
y = model.predict()
# We can use score directly as it automatically invokes predict
model

Running the above cell again will give you entirely different model and score.

#### How can we minimize noise due to random train/test splits?

### Cross validation: `cross_val_score(estimator, X, y)`

- requires `from sklearn.model_selection import cross_val_score`
-  do many different train/test splits of the values, fitting and scoring the model across each combination
- documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html

In [None]:
train, test = train_test_split(df)

model = LinearRegression()
scores = 
scores

In [None]:
# Compute mean of the scores
scores

#### How can we compare models?
- model 1: POP => DEATHS
- model 2: CASES (POS_CUM_CP) => DEATHS

In [None]:
model1 = LinearRegression()
model2 = LinearRegression()
model1_scores = cross_val_score(model1, )
model2_scores = cross_val_score(model2, )

In [None]:
model1_scores.mean()

In [None]:
model2_scores.mean()

Which of these two models do you think will perform better? Probably model2.

In [None]:
means = pd.Series({"model1": model1_scores.mean(),
                   "model2": model2_scores.mean()})
means.plot.bar(figsize=(3, 3))

How do we know the above difference is not noise? Let's calculate standard deviation and display error bars on the bar plot.

In [None]:
model1_scores.std()

In [None]:
model2_scores.std()

In [None]:
err = pd.Series({"model1": model1_scores.std(),
                 "model2": model2_scores.std()})
err

In [None]:
# Plot error bar by passing argument to paramenter yerr
means.plot.bar(figsize=(3, 3), )

Pick a winner and run it one more time against test data.

#### How can we use multiple x variables (multiple regression)?

In [None]:
model = LinearRegression()
xcols = ['POS_0_9_CP', 'POS_10_19_CP', 'POS_20_29_CP', 'POS_30_39_CP',
       'POS_40_49_CP', 'POS_50_59_CP', 'POS_60_69_CP', 'POS_70_79_CP',
       'POS_80_89_CP', 'POS_90_CP']
ycol = "DTH_CUM_CP"

model.fit(train[xcols], train[ycol])
model.score(test[xcols], test[ycol]) 

#### How can we interpret what features the model is relying on?

In [None]:
model.coef_

In [None]:
pd.Series(model.coef_).plot.bar(figsize=(3, 2))