Notebook Title: Huntsman_ml.ipynb

Author: Kate Huntsman

Link to GitHub Repository: https://github.com/katehuntsman/datafun-07-ml

In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

## Part 1 - Chart a Straight Line

## Plot Celcius vs Fahrenheit

In [None]:
c = lambda f: 5 / 9 * (f - 32)

temps = [(f, c(f)) for f in range(0, 101, 10)]

temps_df = pd.DataFrame(temps, columns=['Fahrenheit', 'Celsius'])

axes = temps_df.plot(x='Fahrenheit', y='Celsius', style='.-')

y_label = axes.set_ylabel('Celsius')

## Part 2 - Prediction

## Data Acquisition

In [10]:
csv_file_path = '/Users/katehuntsman/Downloads/IntroToPython-master/examples/ch10/snippets_py/ave_hi_nyc_jan_1895-2018.csv'# Path to your CSV file


# Load the CSV file into a DataFrame
nyc_df = pd.read_csv(csv_file_path)

## Data Inspection

In [None]:
# Display head and tail
print(nyc_df.head())
print(nyc_df.tail())

## Data Cleaning

In [19]:
# Clean the data
nyc_df.columns = ['Year', 'Avg_High_Temp', 'Anomaly']
nyc_df['Year'] = pd.to_datetime(nyc_df['Year'], format='%Y', errors='coerce')

## Descriptive Statistics

In [None]:
pd.set_option('display.precision', 2)
nyc_df.describe()

## Build the Model

In [None]:
from scipy.stats import linregress
years = nyc_df['Year'].dt.year.values
avg_temps = nyc_df['Avg_High_Temp'].values

slope, intercept, r_value, p_value, std_err = linregress(years, avg_temps)
print(f"Slope: {slope:.4f}, Intercept: {intercept:.4f}")


## Predict

In [None]:
year_to_predict = 2024
predicted_temp = slope * year_to_predict + intercept
print(f'Predicted Average High Temp for January {year_to_predict}: {predicted_temp:.2f} °F')

## Visualizations

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x=nyc_df['Year'].dt.year, y=nyc_df['Avg_High_Temp'], color='blue')
plt.plot(nyc_df['Year'].dt.year, slope * nyc_df['Year'].dt.year + intercept, color='red')
plt.title('Avg High Temp in NYC - January')
plt.xlabel('Year')
plt.ylabel('Avg High Temp (°F)')
plt.axhline(y=predicted_temp, color='green', linestyle='--', label='Predicted Temp 2024')
plt.legend()
plt.show()

## Part 3 - Prediction

## Build the Model

In [None]:
X = nyc_df['Year'].dt.year.values.reshape(-1, 1)
y = nyc_df['Avg_High_Temp']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Train the Model

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)

print(f"Coefficient: {model.coef_[0]}, Intercept: {model.intercept_}")

## Test the Model

In [None]:
# Test the model by predicting the values on the test set
y_pred = model.predict(X_test)

# Compare predicted vs actual values
comparison_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(comparison_df.head())

## Predict

In [None]:
future_prediction = model.predict(np.array([[2024]]))
print(f"Predicted Average High Temp in Jan 2024: {future_prediction[0]:.2f} °F")

## Visualizations

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_test.flatten(), y=y_test, color='blue', label='Test Data')
plt.plot(X.flatten(), model.predict(X), color='red', label='Best Fit Line')
plt.title('Model Prediction')
plt.xlabel('Year')
plt.ylabel('Avg High Temp (°F)')
plt.axhline(y=future_prediction[0], color='green', linestyle='--', label='Predicted Temp 2024')
plt.legend()
plt.show()

## Part 4 - Insights and Comparisons

In this project, we employed two methods to predict the average high temperature in January for New York City: the **SciPy `linregress` method** and the **scikit-learn Linear Regression model**. Both approaches have their strengths and weaknesses, which are discussed below.

### Method 1: SciPy `linregress`
- **Simplicity**: The `linregress` function from SciPy provides a straightforward and efficient way to calculate the linear regression line. It outputs the slope, intercept, and correlation coefficient directly.
- **Use Case**: This method is particularly effective for quick analyses or when working with smaller datasets. It allows for rapid results without extensive setup.
- **Limitations**: While quick and easy, `linregress` is less flexible for more complex modeling tasks. It doesn't provide built-in support for feature scaling, cross-validation, or more sophisticated metrics.

### Method 2: Scikit-learn Linear Regression
- **Flexibility**: Using scikit-learn allows for more robust modeling capabilities, such as splitting data into training and testing sets, which is critical for evaluating model performance.
- **Performance Evaluation**: This method facilitates the use of various performance metrics (e.g., R-squared, Mean Absolute Error) and allows for more intricate preprocessing steps like feature scaling.
- **Complexity**: However, it requires a bit more setup and understanding of machine learning practices. For beginners, it might seem daunting initially.

### Conclusion
Both methods yielded similar predictions for the average high temperature in January 2024. The choice between them largely depends on the specific requirements of the analysis. For quick, simple predictions, `linregress` is adequate. For more detailed modeling and evaluation, scikit-learn is the better option.

In future analyses, I would recommend considering the trade-offs between simplicity and flexibility, depending on the project's objectives and the complexity of the data involved.

## Part 5 - Bonus