<a href="https://colab.research.google.com/github/manthony1/2023-Sass-Mike-Anthony---Portfolio-Site/blob/main/Sea_Level_Rise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Predicting Sea Level
### Linear Regression Mini-Project

https://www.freecodecamp.org/learn/data-analysis-with-python/data-analysis-with-python-projects/sea-level-predictor

1. Use Pandas to import the data from epa-sea-level.csv.
2. Use matplotlib to create a scatter plot using the Year column as the x-axis and the CSIRO Adjusted Sea Level column as the y-axis.
3. Use the linregress function from scipy.stats to get the slope and y-intercept of the line of best fit. Plot the line of best fit over the top of the scatter plot. Make the line go through the year 2050 to predict the sea level rise in 2050.
4. Plot a new line of best fit just using the data from year 2000 through the most recent year in the dataset. Make the line also go through the year 2050 to predict the sea level rise in 2050 if the rate of rise continues as it has since the year 2000.
5. The x label should be Year, the y label should be Sea Level (inches), and the title should be Rise in Sea Level.

In [None]:
#!pip install scipy

In [21]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
import numpy as np
#from scipy.stats import linregress

In [22]:
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/data/SeaLevelRise/epa-sea-level2.csv')

In [24]:
df = pd.DataFrame(data)
print(df.head())

         Year  CSIRO Adjusted Sea Level  Lower Error Bound  Upper Error Bound  \
0  1880-03-15                  0.000000          -0.952756           0.952756   
1  1881-03-15                  0.220472          -0.732283           1.173228   
2  1882-03-15                 -0.440945          -1.346457           0.464567   
3  1883-03-15                 -0.232283          -1.129921           0.665354   
4  1884-03-15                  0.590551          -0.283465           1.464567   

   NOAA Adjusted Sea Level  
0                      NaN  
1                      NaN  
2                      NaN  
3                      NaN  
4                      NaN  


In [25]:
 #trim year data to just year
trim_year = df['Year'].str[:4]
print(trim_year.head())

0    1880
1    1881
2    1882
3    1883
4    1884
Name: Year, dtype: object


In [26]:
# Create x and y axis
#x = pd.Series(trim_year.values, dtype='int32')
x = pd.Series(trim_year.astype('int32'))
y = pd.Series(df['CSIRO Adjusted Sea Level'].values)


# create additional numpy array for additional years
#x_more_years = np.arange(2014, 2051, 1)
print(x.head(), x.tail())

0    1880
1    1881
2    1882
3    1883
4    1884
Name: Year, dtype: int32 130    2010
131    2011
132    2012
133    2013
134    2014
Name: Year, dtype: int32


In [33]:
df[trim_year].head()


KeyError: ignored

In [28]:
x[0].dtype

dtype('int32')

In [30]:
df.describe()


Unnamed: 0,CSIRO Adjusted Sea Level,Lower Error Bound,Upper Error Bound,NOAA Adjusted Sea Level
count,134.0,134.0,134.0,22.0
mean,3.650341,3.204666,4.096016,7.422835
std,2.485692,2.663781,2.312581,0.729114
min,-0.440945,-1.346457,0.464567,6.297493
25%,1.632874,1.07874,2.240157,6.852969
50%,3.312992,2.915354,3.71063,7.498143
75%,5.587598,5.329724,5.845472,8.011607
max,9.326772,8.992126,9.661417,8.6637


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135 entries, 0 to 134
Data columns (total 5 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Year                      135 non-null    object 
 1   CSIRO Adjusted Sea Level  134 non-null    float64
 2   Lower Error Bound         134 non-null    float64
 3   Upper Error Bound         134 non-null    float64
 4   NOAA Adjusted Sea Level   22 non-null     float64
dtypes: float64(4), object(1)
memory usage: 5.4+ KB


In [None]:
# Create first line of best fit
#slope, intercept, rvalue, pvalue, stderr = scipy.stats.linregress(x, y)

res = scipy.stats.linregress(x, y) #res var has tuple of all the values

# y = mx + b

#m - slope
#b - intercept

# rvalue - correlation coefficient, correlation coefficient indicates the strength
# of the linear relationship between the data points


#Squaring the r value gives us the coefficient of determination, which is the proportion
# of the variance in one variable that is explained by the variance in the other variable.
# This value can range from 0 to 1, where 0 indicates that no variance in the first variable is
# explained by the variance in the second variable, and 1 indicates that all of the variance in the
# first variable is explained by the variance in the second variable.

#The coefficient of determination is a useful measure of how well a linear model fits a set of data.
#A higher coefficient of determination indicates that the model is a better fit for the data.


# pvalue: The p-value of the hypothesis test that the slope is zero
# p-value indicates the probability of obtaining the observed data or more extreme data
# if the null hypothesis (that the slope is zero) is true.
# A p-value less than 0.05 is considered statistically significant, meaning that there is a 5% or less
# chance that the observed data occurred by chance.

print(f"Slope: {res.slope:.6f}")
print(f"Y-intercept: {res.intercept:.6f}")
print(f"Std Error: {res.stderr:.6f}")
print(f"R-value (Correlation coefficient): {res.rvalue:.6f}")
print(f"R-squared (Coefficient of determination): {res.rvalue**2:.6f}")
print(f"P-value: {res.pvalue:.6f}")

In [None]:
res

In [None]:
 # Create scatter plot
plt.scatter(x,y, color = 'maroon', alpha=0.3, label="Sea Level Rise Data")
plt.plot(x, res.intercept + res.slope*x, 'r', label='Best Fit Line')

# Add labels and title
plt.xlabel('Year')
plt.ylabel('Sea Level (inches)')
plt.title('Sea Level Rise: 1880 - 2014')

plt.legend()

# Show the plot.
plt.show()

In [None]:
#sns.set_style("darkgrid", {"grid.color": ".8", "grid.linestyle": "-"})
sns.set_theme(context='notebook', style='darkgrid', palette='deep', font='sans-serif', font_scale=1, color_codes=True, rc=None)
sns.regplot(data=df, x="Year", y="CSIRO Adjusted Sea Level", ci=100, marker="o", color=".3", line_kws=dict(color="r"), order=3)

In [None]:
# Save plot and return data for testing (DO NOT MODIFY)
plt.savefig('sea_level_plot.png')