# Python Machine Learning 

------
## Tools to Learn / Use

* Load and Clean Data  
  Use Pandas unless you have a reason not to (data too large to fit in memory)
* Explore Data  
  Pandas, numpy, matplotlib, seaborn
* Models  
  SciPy, scikit-learn  
  https://www.w3schools.com/python/python_ml_getting_started.asp  
  https://www.kaggle.com/learn/intro-to-machine-learning  
  Deep Learning: Pytorch, Keras, TensorFlow  
  https://www.datacamp.com/tutorial/pytorch-vs-tensorflow-vs-keras
  
------
## Good Reference Site

* https://machinelearningmastery.com/start-here/
* https://machinelearningmastery.com/start-here/#algorithms
* https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
* https://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/

------
## Python Packages


In [None]:
# Python version
import sys
print(f'Python: {sys.version}')

# pandas
import pandas as pd
print(f'pandas: {pd.__version__}')

# numpy
import numpy as np
print(f'numpy: {np.__version__}')

# matplotlib
import matplotlib
import matplotlib.pyplot as plt
print(f'matplotlib: {matplotlib.__version__}')

# seaborn
import seaborn as sns
print(f'seaborn: {sns.__version__}')

# scikit-learn
import sklearn
print(f'sklearn: {sklearn.__version__}')

-----
## Problem

We will use weather data from [MSU's North Farm](http://deltaweather.extension.msstate.edu/msu-north-farm-starkv) (next to the research park) to see if we can predict if it is going to rain tomorrow. 

We will use daily information from 6/1/2019 to 9/1/2024.

| Attribute   | Definition    |
|:------------|:--------------|
|Date|                Record Date MM/DD/YYYY  |
|Julian|              Record Julian Date NNN day of the year  |
|AirTempMax|          Air Temperature Max (Degrees Fahrenheit F)  |
|AirTempMin|          Air Temperature Min (Degrees Fahrenheit F)  |
|AirTempObsv|         Air Temperature Observed (Degrees Fahrenheit F) This is an instantaneous reading for air temperature at 7:00:00 standard time.  |
|HumidMax|            Relative Humidity Max (Percent)  |
|HumidMin|            Relative Humidity Min (Percent)  |
|HumidObsv|           Relative Humidity Observed (Percent) This is an instantaneous reading at 7:00:00 standard time.  |
|Precip|              Precipitation (Inches n.nn) total rain fall that occurred for the day.  |
|WindRun|             Wind Run (Miles) Wind travel for the day.  |
|AvgWindSpeed|        Wind Speed (Miles Per Hour) Resultant / Average speed for the day.  |
|WindDirection|       Wind Direction (Degrees) Resultant Direction for the day  |
|SolarRadiation|      Solar Radiation (Langley's) that occurred for the day.  |
|SoilTempMax|         Soil Temperature Max at 2 inch depth (Degrees Fahrenheit F) that occurred during the day.  |
|SoilTempMin|         Soil Temperature Min at 4 inch depth (Degrees Fahrenheit F) that occurred during the day.  |
|SoilTempObsv|        Soil Temperature Observed at 4 inch depth (Degrees Fahrenheit F) This is an instantaneous reading 4 inch soil temperature at 7:00:00 standard time.  |


-----
## Load Data



In [None]:
RawData = pd.read_csv('https://raw.githubusercontent.com/jcrumpton/DSCI_6113_Data/refs/heads/main/DS_Club/MSU_North_Farm.csv')
display(RawData.info())
RawData.head()

In [None]:
plt.figure(figsize=(15, 15))
sns.heatmap(RawData.isnull(), cbar=False);


In [None]:
Data = RawData.dropna().copy() 
Data.info()


In [None]:
# Convert Types

Data['Date'] = pd.to_datetime(Data['Date'], format="%m/%d/%y", errors='coerce') 
Data['AirTempMin'] = Data['AirTempMin'].astype(int)  
Data['AirTempObsv'] = Data['AirTempObsv'].astype(int)  
Data['HumidMin'] = Data['HumidMin'].astype(int)  
Data['HumidObsv'] = Data['HumidObsv'].astype(int)  
Data['SoilTempObsv'] = Data['SoilTempObsv'].astype(int)

Data.info()

In [None]:
Data.head()


In [None]:
Data.describe()

-----
## Add Columns


In [None]:
# Did it rain on a given day?

Data['RainYN'] = np.where(Data['Precip'] > 0, 1, 0)
Data.head()


In [None]:
# Shift RainYN back one day to use as RainTom (rain tomorrow)
Data['RainTom'] = Data['RainYN'].shift(-1)
Data.tail()

In [None]:
Data = Data.dropna()
Data['RainTom'] = Data['RainTom'].astype(int)
Data.tail(5)

-----
## Explore Data

In [None]:
# distributions
print(Data.groupby('RainTom').size())


In [None]:
# plt.hist(Data['RainTom'])
sns.countplot(x ='RainTom', data = Data)
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(Data['Date'], Data['Precip'])


In [None]:
# Extract Month from Date
Data['Month'] = pd.to_datetime(Data['Date']).dt.month

# Move Month to next to Date
column_to_move = Data.pop('Month')
Data.insert(1, 'Month', column_to_move)

Data.head()

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(Data.corr());


-----
## Machine Learning


In [None]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

def make_confusion_matrix(cf,
                          group_names=None,
                          categories='auto',
                          count=True,
                          percent=True,
                          cbar=True,
                          xyticks=True,
                          xyplotlabels=True,
                          sum_stats=True,
                          figsize=None,
                          cmap='Blues',
                          title=None):
    '''
    This function will make a pretty plot of an sklearn Confusion Matrix cm using a Seaborn heatmap visualization.
    https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea
    https://github.com/DTrimarchi10/confusion_matrix

    Arguments
    ---------
    cf:            confusion matrix to be passed in

    group_names:   List of strings that represent the labels row by row to be shown in each square.

    categories:    List of strings containing the categories to be displayed on the x,y axis. Default is 'auto'

    count:         If True, show the raw number in the confusion matrix. Default is True.

    normalize:     If True, show the proportions for each category. Default is True.

    cbar:          If True, show the color bar. The cbar values are based off the values in the confusion matrix.
                   Default is True.

    xyticks:       If True, show x and y ticks. Default is True.

    xyplotlabels:  If True, show 'True Label' and 'Predicted Label' on the figure. Default is True.

    sum_stats:     If True, display summary statistics below the figure. Default is True.

    figsize:       Tuple representing the figure size. Default will be the matplotlib rcParams value.

    cmap:          Colormap of the values displayed from matplotlib.pyplot.cm. Default is 'Blues'
                   See http://matplotlib.org/examples/color/colormaps_reference.html
                   
    title:         Title for the heatmap. Default is None.

    '''


    # CODE TO GENERATE TEXT INSIDE EACH SQUARE
    blanks = ['' for i in range(cf.size)]

    if group_names and len(group_names)==cf.size:
        group_labels = ["{}\n".format(value) for value in group_names]
    else:
        group_labels = blanks

    if count:
        group_counts = ["{0:0.0f}\n".format(value) for value in cf.flatten()]
    else:
        group_counts = blanks

    if percent:
        group_percentages = ["{0:.2%}".format(value) for value in cf.flatten()/np.sum(cf)]
    else:
        group_percentages = blanks

    box_labels = [f"{v1}{v2}{v3}".strip() for v1, v2, v3 in zip(group_labels,group_counts,group_percentages)]
    box_labels = np.asarray(box_labels).reshape(cf.shape[0],cf.shape[1])


    # CODE TO GENERATE SUMMARY STATISTICS & TEXT FOR SUMMARY STATS
    if sum_stats:
        #Accuracy is sum of diagonal divided by total observations
        accuracy  = np.trace(cf) / float(np.sum(cf))

        #if it is a binary confusion matrix, show some more stats
        if len(cf)==2:
            #Metrics for Binary Confusion Matrices
            precision = cf[1,1] / sum(cf[:,1])
            recall    = cf[1,1] / sum(cf[1,:])
            f1_score  = 2*precision*recall / (precision + recall)
            stats_text = "\n\nAccuracy={:0.3f}\nPrecision={:0.3f}\nRecall={:0.3f}\nF1 Score={:0.3f}".format(
                accuracy,precision,recall,f1_score)
        else:
            stats_text = "\n\nAccuracy={:0.3f}".format(accuracy)
    else:
        stats_text = ""


    # SET FIGURE PARAMETERS ACCORDING TO OTHER ARGUMENTS
    if figsize==None:
        #Get default figure size if not set
        figsize = plt.rcParams.get('figure.figsize')

    if xyticks==False:
        #Do not show categories if xyticks is False
        categories=False


    # MAKE THE HEATMAP VISUALIZATION
    plt.figure(figsize=figsize)
    sns.heatmap(cf,annot=box_labels,fmt="",cmap=cmap,cbar=cbar,xticklabels=categories,yticklabels=categories)

    if xyplotlabels:
        plt.ylabel('True label')
        plt.xlabel('Predicted label' + stats_text)
    else:
        plt.xlabel(stats_text)
    
    if title:
        plt.title(title)

In [None]:
# Create training & testing sets
from sklearn.model_selection import train_test_split
y = Data["RainTom"]
X = Data.drop(columns=["Date","RainTom"])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print (f'X_train: {X_train.shape} \nX_test: {X_test.shape} \ny_train: {y_train.shape} \ny_test: {y_test.shape}')


In [None]:
# What is our baseline? 
#
# For this problem, what would happen if we always answered no?
from sklearn.metrics import accuracy_score

y_pred_baseline = np.zeros(len(y_test))
print(f"baseline model accuracy = {accuracy_score(y_test, y_pred_baseline)}")


In [None]:
from sklearn.linear_model import LogisticRegression

# Set regularization rate
reg = 0.01


# train a logistic regression model on the training set
lr_clf = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

y_pred_lr = lr_clf.predict(X_test)
print(f"LR model accuracy = {accuracy_score(y_test, y_pred_lr)}")

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier().fit(X_train, y_train)
y_pred_dt = dt_clf.predict(X_test)

print(f"DT model accuracy = {accuracy_score(y_test, y_pred_dt)}")


In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier().fit(X_train, y_train)
y_pred_rf = rf_clf.predict(X_test)


print(f"RF model accuracy = {accuracy_score(y_test, y_pred_rf)}")

In [None]:
# Create a series containing features' importance from the model and feature names from the training data
feature_importances = pd.Series(rf_clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

# Plot a simple bar chart
feature_importances.plot.bar()

In [None]:
# Tune hyperparameters - possible on all previous models

if False:
    from sklearn.model_selection import RandomizedSearchCV

    param_dist = {'n_estimators': range(50, 551, 50),
                  'max_depth': range(1,10)}

    # Create a random forest classifier
    rf = RandomForestClassifier()

    # Use random search to find the best hyperparameters
    rand_search = RandomizedSearchCV(rf, 
                                     param_distributions=param_dist, 
                                     n_iter=10,
                                     cv=5,
                                     random_state=1)

    # Fit the random search object to the data
    rand_search.fit(X_train, y_train)

    # Create a variable for the best model
    best_rf = rand_search.best_estimator_

    # Print the best hyperparameters
    print('Best hyperparameters:',  rand_search.best_params_)

    predictions = best_rf.predict(X_test)
    accuracyScore = accuracy_score(y_test, predictions)
    print(f"Validation Accuracy with Tuned Random Forest: {accuracyScore}")
    
# Best hyperparameters: {'n_estimators': 50, 'max_depth': 9}
# Validation Accuracy with Tuned Random Forest: 0.6901041666666666


In [None]:
from sklearn.metrics import confusion_matrix

labels = ['True Neg','False Pos','False Neg','True Pos']
cf_matrix = confusion_matrix(y_test, y_pred_rf)
make_confusion_matrix(cf_matrix, group_names=labels)

__Accuracy__ the ratio of true positives (TP) and true negatives (TN) to the total number of samples

__Precision__ the ratio of TP to the sum of TP and false positives (FP)

__Recall__ the ratio of TP to the sum of TP and false negatives (FN)

__F1 Score__ the harmonic mean of precision and recall

__Good and Bad Scores__
* Accuracy above 0.90 (90%) is considered excellent because it means the model correctly predicts 9 out of 10 instances. 

* Precision, Recall, and F1 score above 0.80 (80%) indicate strong performance. In a cancer detection model, Precision of 0.80 means 80% of positive predictions are correct, while Recall of 0.80 means 80% of actual cancer cases are identified. An F1 score of 0.80 signifies a balanced trade-off between Precision and Recall, crucial when both false positives and negatives have significant consequences.

* Scores below 0.50 (50%) are typically poor because they indicate that the model's performance is worse than random guessing. For instance, a credit card fraud detection system with an accuracy of 0.45 would be unreliable, as it's more likely to misclassify transactions than to correctly identify them.


In [None]:
baseline_cf_matrix = confusion_matrix(y_test, y_pred_baseline)
make_confusion_matrix(baseline_cf_matrix, group_names=labels)


-----
## Where to go next....


### Unbalanced Classes

>  In a balanced dataset, the number of Positive and Negative labels is about equal. However, if one label is more common than the other label, then the dataset is imbalanced.
>
> Imbalanced datasets sometimes don't contain enough minority class examples to train a model properly.

* https://developers.google.com/machine-learning/crash-course/overfitting/imbalanced-datasets
* https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
* https://datasciencehorizons.com/handling-imbalanced-datasets-in-scikit-learn-techniques-and-best-practices/

In [None]:
from sklearn.utils import resample
#
# Create oversampled training data set for minority class
#
X_oversampled, y_oversampled = resample(X[y == 1],  # Fewer rain days than 'no rain' days
                                        y[y == 1],
                                        replace=True,
                                        n_samples=X[y == 0].shape[0],
                                        random_state=1)

# Append the oversampled minority class to the imbalanced data and related labels
X_balanced = pd.concat([X[y == 0], X_oversampled])
y_balanced = pd.concat([y[y == 0], y_oversampled])

print(X_balanced.shape)
print(y_balanced.shape)
print(y_balanced.sum())

# plt.hist(Data['RainTom'])
sns.countplot(x ='RainTom', data = pd.DataFrame(y_balanced))
plt.show()

In [None]:
from sklearn.ensemble import RandomForestClassifier

X_train2, X_test2, y_train2, y_test2 = train_test_split(X_balanced, y_balanced, test_size=0.2, random_state=1)
print (f'X_train2: {X_train2.shape} \nX_test2: {X_test2.shape} \ny_train2: {y_train2.shape} \ny_test2: {y_test2.shape}')

bal_rf_clf = RandomForestClassifier().fit(X_train2, y_train2)
y_pred_bal_rf =bal_rf_clf.predict(X_test2)

print(f"\nRF model accuracy = {accuracy_score(y_test2, y_pred_bal_rf)}")

In [None]:
cf_matrix = confusion_matrix(y_test2, y_pred_bal_rf)
make_confusion_matrix(cf_matrix, group_names=labels)

In [None]:
# How does the new model trained on balanced date do on original test data?

y_pred_bal_rf = bal_rf_clf.predict(X_test)

#print(f"RF model accuracy = {accuracy_score(y_test, y_pred_bal_rf)}")
cf_matrix = confusion_matrix(y_test, y_pred_bal_rf)
make_confusion_matrix(cf_matrix, group_names=labels)


In [None]:
# Do balanced classes help with any classifier?

from sklearn.linear_model import LogisticRegression

# Set regularization rate
reg = 0.01

# train a logistic regression model on the training set
bal_lr_clf = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train2, y_train2)

y_pred_bal_lr = bal_lr_clf.predict(X_test)
print(f"LR model accuracy = {accuracy_score(y_test, y_pred_bal_lr)}")

<br>

### Normalize / Scale Data

> Many machine learning algorithms perform better when numerical input variables are scaled to a standard range.
> 
> This includes algorithms that use a weighted sum of the input, like linear regression, and algorithms that use distance measures, like k-nearest neighbors.

* https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
* http://archive.today/2024.09.25-195922/https://www.geeksforgeeks.org/data-normalization-with-python-scikit-learn/


<br>

### Time Series Data

We ignored any time series information when training our models.

* https://builtin.com/data-science/time-series-forecasting-python
* https://www.kaggle.com/code/kanncaa1/time-series-prediction-tutorial-with-eda
* https://machinelearningmastery.com/start-here/#timeseries


<br>

### Predicting Rainfall Amount Categories

* https://www.kaggle.com/code/satishgunjal/binary-multiclass-classification-using-sklearn#Multiclass-Classification
* http://archive.today/2023.10.29-124722/https://towardsdatascience.com/comprehensive-guide-to-multiclass-classification-with-sklearn-127cc500f362
* http://archive.today/2024.09.25-193956/https://www.geeksforgeeks.org/multiclass-classification-using-scikit-learn/


In [None]:
bins = [-1.0, 0.00001, 0.39, 1.18, 100]
labels = ['NoRain', 'Light', 'Moderate', 'Heavy']
Data['RainCategory'] = pd.cut(Data['Precip'], bins, labels=labels)

print(Data.groupby('RainCategory', observed=True).size())

In [None]:
# Just pyplot
# plt.hist(Data.RainCategory)

# Using seaborn
sns.countplot(x ='RainCategory', data = Data)
plt.show()