In [None]:
'''
I decided to do a model based on current data. This model is assuming you already know the variables for the season.
Its assuming you know the yards, the tds, and all the necessary stats. Therefore, this mode comes out great. BUT,
it does not accomplish my overall goal of trying to predict how well a running back will do because I already know the outcome.
I used this just to see if any other fields are useful for the model.
'''

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_absolute_error

# Specify the path to your CSV file
csv_file_path = r"C:\Users\sulli\OneDrive\Documents\season_on_season_data.csv"

# Create the DataFrame
final_df = pd.read_csv(csv_file_path)

# Load the dataset
data = final_df

# Drop specified fields
fields_to_drop = ['position_first',
                  'fantasy_points_ppr_sum', 'season_first','player_name',
                  'fantasy_points_over_10_sum','index'
                 ]

data = data.drop(columns=fields_to_drop)

# Perform one-hot encoding for string variables
encoded_data = pd.get_dummies(data)

# Select features and target variable
target = data['fantasy_points_sum']
features = encoded_data.drop(columns=['fantasy_points_sum'])  # Exclude the target variable

# Calculate correlation matrix
correlation_matrix = features.corrwith(target)

# Sort the correlations in descending order
sorted_correlations = correlation_matrix.sort_values(ascending=False)

# Print the top correlated features
print("Top correlated features:")
print(sorted_correlations.head(50))  # Adjust the number to display more or fewer features

'''
What this tells me is SUMs carry more weight than Mean and Median because Mean and Median don't factor in injuries.
With injuries, you have less points. Or even someone coming back from IR. Going to use Random Forest next to see what features
are important. Therefore, when I do make it more predictive, I don't focus on injuries as much

'''

# Initialize Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model to the data
rf.fit(features, target)

# Get feature importances
feature_importances = rf.feature_importances_

# Create a DataFrame to display feature importances
feature_importance_df = pd.DataFrame({'Feature': features.columns, 'Importance': feature_importances})

# Sort the DataFrame by importance
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Print the top features
print("Top features according to RandomForest:")
print(feature_importance_df.head(30))  # Adjust the number to display more or fewer features

'''
Random Forest shows me that yards are king.

Now going to look at GBM to see how it looks
'''
# Initialize Gradient Boosting Regressor
gbm = GradientBoostingRegressor(n_estimators=100, random_state=42)

# Fit the model to the data
gbm.fit(features, target)

# Get feature importances
feature_importances = gbm.feature_importances_

# Create a DataFrame to display feature importances
feature_importance_df = pd.DataFrame({'Feature': features.columns, 'Importance': feature_importances})

# Sort the DataFrame by importance
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Print the top features
print("Top features:")
print(feature_importance_df.head(20))  # Adjust the number to display more or fewer features


'''
Interesting that time of possession differs a lot. This is also showing that yards are king while tds definitely matter as well
Now lets try my linear regression model
'''



# Select the features
selected_features = ['total_yards_sum', 'rushing_tds_sum', 'receiving_tds_sum',
                     'fumbles_lost_mean', 'Defensive_Line_Past_3_Years_Sum_sum',
                     'time_of_possession_mean', 'amount of points scored by the team_mean',
                     'targets_sum', 'team_total_yards_mean', 'Offense_Past_3_Years_Sum_sum',
                     'rushing_carries_sum']

# Select features and target variable
X = data[selected_features]
y = data['fantasy_points_sum']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Linear Regression model
linear_reg = LinearRegression()

# Fit the model to the training data
linear_reg.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = linear_reg.predict(X_test)

# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Calculate R-squared
r2 = r2_score(y_test, y_pred)
print("R-squared:", r2)

# Calculate Mean Absolute Error
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

'''
Now for the current year, this gives me an idea of my model working. But, I want to look in the past.
So I am going to tweak my data a little to incorporate that. I am going to set this more up as a prediction where I don't use
stats for the current season. The only stats I use for the current season are things I can learn at the beginning of the year
such as weight, height, pro bowl information. I can't use the yards for the current year. I can use yards for previous years.
Also, I cant use in game data. I have to use data from the past. So in the next part, that is what I am going to tweak. To set
this up more for a prediction rather than the current.
'''