As part of this task we will deploy pandas package for python to AWS Lambda Layer and use serverless computing power to perform exploratory data analysis on COVID-19 data.

#### Launch an EC2 Instance and create our pandas package for python 3.8 runtime
* Go to EC2 Service and Launch a Linux Instance. I am launching an Ubuntu Server 20.04 Instance.
* I am attaching an existing IAM role to this instance which will give permission to access S3 bucket.
<img src='Configure_Instance.png' width=700>
* Choose default security group and launch the instance.
* SSH into your instance. Our Ubuntu 20 Instance has python 3.8 installed by default.
<img src='Python_Version.png' width=700>
* Make a new directory build/python/lib/python3.8/site-packages for packaging pandas library for runtime python 3.8.
* Install pip3 on the instance using command sudo apt install python3-pip.
<code>sudo apt install python3-pip</code>
<img src='Install_Pip.png' width=700>
* Install pandas using command.
<code>pip3 install pandas -t build/python/lib/python3.8/site-packages/</code>
<img src='Install_Pandas.png' width=700>
* Now we will package the build directory into a .zip file to create a pandas package for AWS Lambda.
<code>zip -r pandas.zip .</code>
* Now upload this zip file into our S3 bucket. We need to install awscli for that purpose.
<code>sudo apt-get install awscli</code>
* Now copy this package into S3 bucket.
<code>aws s3 cp pandas.zip s3://covid-19-tracker-2020/lambda</code>
<img src='Copy_To_S3.png' width=700>

#### Deploy our pandas package to AWS Lambda Layer
* Under AWS Lambda Service Go to Layers and Click on Create Layer.
* Choose name for the Layer and upload your pandas.zip file through S3 link URL. Choose Python 3.8 as Compatible runtime and Click Create.
<img src='Create_Layer.png' width=700>
* Create a new Lambda Function. Choose runtime as Python 3.8. Choose existing Execution role which we created and then Click on Create Function.
<img src='Execution_Role.png' width=700>
* Under Layers>Add a Layer Choose your Custom Layer pandas-layer and Choose the most recent version and Click Add.
<img src='Add_Layer.png' width=700>
* Under our lambda function add the following lines of code to verify if pandas package is integrated into our Lambda Function.
<code>
import pandas as pd
print(pd.__version__)
</code>
<img src='Lambda_Function.png' width=700>
* Create test event and Click on Test. Log output returns the pandas version 1.1.4.
<img src='Test_Event.png' width=700>

#### Data analysis using pandas library for Python
* Data Ingestion
    * Reading data from a csv source.
    <code>
    data = pd.read_csv('CA__covid19__latest.csv')
    data.head()
    </code>
    <img src='Data_Ingestion.png' width=700>
    * shape function is used to derive the dimensions of our dataframe.
    <code>print('Number of rows: ',data.shape[0],'\nNumber of columns: ',data.shape[1])</code>
    <img src='Data_Ingestion_2.png' width=700>
    <code>data.columns</code>
    <img src='Data_Ingestion_3.png' width=700>
* Data Cleaning
    * The pruid, prnameFR, percentactive columns would not contribute to our analysis. So, we should drop this column.
    <code>data.drop(columns=['prnameFR','pruid','percentactive'],inplace=True)</code>
    <img src='Data_Cleaning.png' width=700>
    <img src='Data_Cleaning_2.png' width=700>
    * Our data contains a few Null values.We will replace those Null values with 0.
    <img src='Data_Cleaning_3.png' width=700>
    <code>
    #Fill the NaN values
    data.fillna({
            'numtested':0,
            'numrecover':0,
            'numtestedtoday':0,
            'numrecoveredtoday':0
    }, inplace = True) #fillna for entire dataset
    data.head()
    </code>
    <img src='Data_Cleaning_4.png' width=700>
    <img src='Data_Cleaning_5.png' width=700>

#### Exploratory data analysis
* Find the maximum number of Positive COVID-19 cases in a single day.
<code>data['numtoday'].max()</code>
<img src='Data_Analysis.png' width=700>
* Display the date on which the number of cases were highest.
<code>data[:][data['numtoday'] == data['numtoday'].max()]</code>
<img src='Data_Analysis_2.png' width=700>
* Display the dates for which the confirmed cases were more than 4800.
<code>data[data.numtoday>=4800]</code>
<img src='Data_Analysis_3.png' width=700>
* Grouping of our data by province so we could check COVID-19 statistics for each province.
    <code>
    g = data.groupby('prname')
    g
    </code>
    This creates a DataFrameGroupBy object g.
    * To traverse through this object.
    <code>
    for province, province_df in g:
        print(province)
        print(province_df)
    </code>
* The statistics for each group can be visualized by using matplotlib.
<code>
import matplotlib.pyplot as plt
%matplotlib inline
g.plot()
</code>
<img src='Data_Analysis_4.png' width='700'>
* Using the concept of pivot table to show the number of confirmed COVID-19 cases for each province for each date.

<b>Pivot table: Pivot table in pandas is used to transform or reshape data. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.</b>

<code>data.pivot(index="date", columns="prname", values=["numconf"])</code>
<img src='Data_Analysis_5.png' width='700'>