##### Author: Byron Pineda

####  Project 1 
This part of the team project examined if Covid vaccinations administered had a relationship with the reduction of unemployment numbers in the US.  The number of Covid vaccinations administered was used for this study and not the amount delivered. The reason for using the administered vaccinations numbers is even though there were more vaccines delivered if they were not administered then those had no impact. This data was sourced from the Center for Disease Control. 

A common theme we have heard from our government and health officials almost on a regular basis about Covid-19 is that "getting shots in the arms" would be a key to getting back to normalcy, reducing transmission rate, reducing death rates, and getting our economy back on track. Simply getting vaccinated alone would not end the pandemic but would be one of the measures to reduce the impact it has had on our lives and economy. 

Other measures clearly played a more immediate role in improving the unemployment rates since April 2020 such as the three rounds of economic stimulas checks, employees working remotely instead of being laid off, companies/businesses adapting their business models and other things going beyond this study.  If we look at the unemployment rates in the US in April when our unemployment rate peaked at over 14, you will a drastic reduction of unemployment rate to around 8 by August. Since data for vaccination administered were not available until 2020 other factors played a key role in the reduction of unemployment.

When we created the scatter plot and computed the Pearson correlation coefficient between "Unemployment Rates" and "Vaccines Administered" it produced a negative or inverse correlation of -86 which is a strong indicator of a relationship between those two variables. That negative correlation describes a relationship between two variables that move in opposing directions. Or you can think of this as one variable increased (vaccines administered), then the value of the other variable (unemployment rates) decreased. In this case, as more Covid vaccinations were administered there was a decrease in the unemployment rate.

Remember correlation does not imply causality even if there is a strong correlation between the two variables! The correlation coefficient simply represents the degree of association between two sets of measurements. 

Let us look at linear regression and R-squared now. In general, higher R-squared values represent smaller differences between the observed and fitted values. In this analysis the R-squared was 0.75 indicating a fairly strong relationship. The p-value obtained here is 0.0120998 which is well below the standard <.05. If we obtained a large p-value, it would suggest that changes in the predictor (vaccinations administered) are not associated with changes in the response (unemployment rate). However, in this case the small p-value equates to changes in the predictor value (vaccinations administered) are related to changes in the response variable (unemployment rates). At least mathematically there looks to be a statistically significant relationship between the number of vaccinations administered and unemployment rates. 


##### Names of the CSV file used for this analysis.
vac_admin = 'Data_Vaccs_Unemp/Covid Administered Vaccinations in US and States remediated file.csv'

unempl = 'Data_Vaccs_Unemp/unemployment rate in US.csv'

The vaccination file from the CDC required a great amount of cleanup. Excel was used to study the file, build PivotTables and charts, in order to determine how government totals were calculated. Those vaccination records were based on their Morbid and Mortality Weekly Reports collection. Their week starts on Sunday and ends on Saturday. Those records had to be reconcilled to a monthly basis for our other reporting data from BLS which uses monthly totals.  Part of the problem working with their weekly numbers is that when you get to a certain week it may contain part of prior week.  For example, when you look at April you see the last week of March's data as part of that weekly accumulation. The file was remediated from almost 14,000 records to about 250 records. A screen grab of some of the analysis cleanup in Excel.

![image.png](attachment:image.png)

The first four columns were programatically added in to make working with dates simpler. The Covid Administered (in thousands) and (in Millions)  were derived based on the original Covid administered values. 

vacc_admin_us.head()

![image.png](attachment:image.png)

The unemployment data file for the US had to be manually transformed in Excel and saved as a *csv file.
unempl_us.head()

![image.png](attachment:image.png)

# Create the unemployment rates in US for viewing and analysis.

![image.png](attachment:image.png)

# The vaccination file that was remediated for analysis described earlier.
vacc_admin_us.head()

In [10]:
# Get the "US" location only.
us_location = vacc_admin_us.loc[vacc_admin_us["Location"] == "US"]
us_location.head(10)

Unnamed: 0,mm-yyyy,mm/yyyy,yyyy-mm,mmm_yyyy,Date,Location,Covid_Vacs_Administered,Covid_Vacs_Administered (In thousands),Covid_Vacs_Administered (in millions),MMWR_week
440,12-2020,12/2020,2020/12,Dec-2020,12/31/2020,US,3738130,3738.13,3.74,53
441,01-2021,01/2021,2021/01,Jan-2021,1/31/2021,US,31123299,31123.3,31.12,5
442,02-2021,02/2021,2021/02,Feb-2021,2/28/2021,US,75236003,75236.0,75.24,9
443,03-2021,03/2021,2021/03,Mar-2021,3/31/2021,US,150273292,150273.29,150.27,13
444,04-2021,04/2021,2021/04,Apr-2021,4/30/2021,US,240159677,240159.68,240.16,17
445,05-2021,05/2021,2021/05,May-2021,5/31/2021,US,295891325,295891.33,295.89,22
446,06-2021,06/2021,2021/06,Jun-2021,6/30/2021,US,326521526,326521.53,326.52,26
447,07-2021,07/2021,2021/07,Jul-2021,7/12/2021,US,334600770,334600.77,334.6,28


Set two different size x-axis and two different y-axis on 1 plot.
Please find specifications for the .subplots parameters at:
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplot.html

Unemployment rate plot
ax.plot(x_axis2,y_axis2, color="crimson")

Plot the vaccinations administered in terms of 10 millions as seen below.


![image.png](attachment:image.png)

#### Scatter Plot and correlation
Let's get the x-axis and y-axis values for the scatter plot so we can do a correlation on
the number of vaccinations administered (in 10 millions) versus unemployment in the US.

To do a scatter plot the number of elements in the x-axis and y-axis must be equivalent.
Vaccinations administered data was only available from December 2020 through July 2021. For unemployment
numbers were available from January 2020 through June 2021 so we need to "throw out" some members of our
data set. That means keeping data from December 2020 through June 2021 and discarding the other data elements.

##### The x-axis and y-axis data has been collected for the scatter plot. However, before doing that analysis let us make a linear plot using that data to have a macro-level view of unemployment rate vs. vaccinations administered for December 2020 - June 2021.

##### Create the scatter plot and compute the Pearson correlation coefficient between "Unemployment Rates" and "Vaccines Administered"
A negative or inverse correlation describes a relationship between two variables that move in opposing directions. Or you can think of this as one variable increases (vaccines administered), then the value of the other variable (unemployment rates) decreases. In this case, as more Covid vaccinations were administered there was a decrease in the unemployment rate. 
 
The correlation coefficient can range from -1 to +1. A correlation coefficient with an absolute value of .75 is considered a strong relationship. The correlation coefficient computed for this comparison was -.86  which is a very strong indicator that there is a relationship between these two variables.

Remember correlation does not imply causality even if there is a strong correlation between the two variables! The correlation coefficient simply represents the degree of association between two sets of measurements. 

![image.png](attachment:image.png)

This was saved and committed again because multiple instances of the same picture but now corrected.

##### Calculate and show the linear regression equation and line to plot.
The mathematical formula is composed of a response variable (y), in this case the unemployment rate, and the predictor variable (x) or the vaccinations administered. The formula predicts in this case unemployment rate, when vaccinations administered values are known.

Simply put an R-squared indicates a measure of how close the data is to the fitted regression line. R-squared is the square of the Pearson's correlation coefficient. In the above graph, Pearson's correlation coefficient was calculated as (-.86) so the expected R-squared value would be .7496 approximately.

In general, higher R-squared values represent smaller differences between the observed and fitted values. In this analysis the R-squared was 0.75 indicating a fairly strong relationship. The p-value obtained here is 0.0120998 which is well below the standard <.05. If we obtained a large p-value, it would suggest that changes in the predictor (vaccinations administered) are not associated with changes in the response (unemployment rate). However, in this case the small p-value equates to changes in the predictor value (vaccinations administered) are related to changes in the response variable (unemployment rates). 

![image.png](attachment:image.png)