# Part 1 : Dataframe

* Pandas is an open source Python library for data analysis. It is very powerful toolkit for reading, filtering, manipulating and exporting data.
  https://pandas.pydata.org/
* Since Pandas is not part of the Python standard library, you have to first tell Python to load the library.
* When working with Pandas functions, it is common practice to give pandas the alias pd.

In [None]:
import pandas as pd

* Pandas DataFrame is two-dimensional tabular data structure with labeled axes (rows and columns).
* Dataframe is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

* Dataframe can be created through the combination of **key** - **values**.
* The **key** represents the column name and the **values** are the contents of the column.

## 1.1 Loading dataset

* With the pandas library loaded, we can use the read_csv function to load a CSV data file.
* You can also load different types of data like JSON, HTML, EXCEL, SAS, etc.
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
* Let's load data about Covid-19 worldwide from WHO (World Health Organization) https://covid19.who.int/table

In [None]:
#df_covid = pd.read_csv('./who_covid19.csv')

import io
from google.colab import files
uploaded = files.upload()

df_covid = pd.read_csv(io.StringIO(uploaded['who_covid19.csv'].decode('utf-8')))

* A DataFrame is similar to Excel workbook tabular datasheet.

## 1.2 Subsetting columns and rows

* Today's data often has too many cells to make sense of all the printed information. Instead, the best way to look at our data is to inspect it in parts by looking at various subsets of the data.
* We already saw that we can use the **head** method of a dataframe to look at the first five rows of our data. This is useful to see if our data loaded properly and to get a sense of each of the columns, its name, and its contents.
* Sometimes, however, we may want to see only particular rows, columns, or values from our data.

In [None]:
df_covid.head()

* If we want only a specific column from our data, we can access the data using square brackets.

In [None]:
country = df_covid['Name']

* To specify multiple columns by the column name, we need to pass in a list between the square brackets

In [None]:
subset = df_covid[['Name', 'Cases - cumulative total']]

* We can subset a dataframe with a boolean subsetting.

In [None]:
df_covid[df_covid['Cases - cumulative total']>1e7]

* You can insert a new column in the dataframe.

In [None]:
df_covid['death_rate'] = df_covid['Deaths - cumulative total'] / df_covid['Cases - cumulative total']

## 1.3 Describe your data

* describe() is used to view some basic statistical details like percentile, mean, std etc. of a dataframe.

In [None]:
df_covid.describe()

# Part 2 : Data Wragling

* You will learn how to work with messy data: extract, clean, and deal with invalid or missing values. 
* Data manipulation using Pandas and other Python packages

In [None]:
import numpy as np

## 2.1 Missing data

* Missing data is common in most data analysis applications.
* Pandas uses the floating point value NaN (Not a Number) to represent missing data in both floating as well as in non-floating point arrays.

### 2.1.1 Filtering out missing data

* With DataFrame objects, you may want to drop rows or columns which are all NA or just those containing any NAs.
* **dropna** by default drops any row containing a missing value.

In [None]:
#df_BM = pd.read_csv('./bigmart_data.csv')

uploaded = files.upload()
df_BM = pd.read_csv(io.StringIO(uploaded['bigmart_data.csv'].decode('utf-8')))

In [None]:
import missingno as msno

In [None]:
!pip install missingno

In [None]:
msno.matrix(df_BM)

In [None]:
msno.bar(df_BM)

In [None]:
df_BM = df_BM.dropna(how='any')

In [None]:
msno.matrix(df_BM)

### 2.1.2 Filling in missing data

* Rather than filtering out missing data, you may want to fill in the “holes” in any number of ways.
* Calling fillna with a constant replaces missing values with that value.

In [None]:
df_BM = pd.read_csv('./bigmart_data.csv')

In [None]:
df_BM.head(10)

In [None]:
df_BM.fillna(0).head(10)

In [None]:
msno.matrix(df_BM)

## 2.2 Merging data

* Merge or join operations combine data sets by linking rows using one or more keys.

In [None]:
df1 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
                    'food': ['fish', 'beans', 'bread']},
                   columns=['name', 'food'])
df2 = pd.DataFrame({'name': ['Mary', 'Joseph'],
                    'drink': ['wine', 'beer']},
                   columns=['name', 'drink'])
# from Python Data Science Handbook: https://www.oreilly.com/library/view/python-data-science/9781491912126/

In [None]:
pd.merge(df1, df2, on='name')

* You probably noticed that the 'c' values and associated data are missing from the result. By default merge does an **inner** join.
* Other possible options are **left**, **right**, and **outer**.

## 2.3 Sort values

* Pandas data frame has a useful sorting function

    * **sort_values()**: to sort pandas data frame by one or more columns

* Each of these functions come with numerous options, like sorting the data frame in specific order (ascending or descending), sorting in place, sorting with missing values, sorting by specific algorithm etc.

In [None]:
df_covid = pd.read_csv('./who_covid19.csv')

In [None]:
df_covid['death_rate'] = df_covid['Deaths - cumulative total']/df_covid['Cases - cumulative total']

In [None]:
df_covid.head()

* Suppose you want to sort the dataframe by "death_rate" then you will use **sort_values**

In [None]:
df_covid.sort_values(by='death_rate')

* `ascending`: The default sorting order is ascending, when you pass False here then it sorts in descending order.

In [None]:
df_covid.sort_values(by='death_rate', ascending=False)

## 2.4 Aggregating

#### **What is the mean number of cases for each region?** : groupby
- In the given data set, you may want to find out **what is the mean number of cases for each region**?
- You can use **groupby()** to achieve this.
- The first step would be to group the data by 'WHO Region' column.

In [None]:
df_covid.head()

In [None]:
grouped = df_covid.groupby('WHO Region').mean()
grouped

# Part 3 : Data Visualization

* Making plots and static or interactive visualizations is one of the most important tasks in data analysis. It may be a part of the exploratory process; for example, helping identify outliers, needed data transformations, or coming up with ideas for models.

## 3.1 Matplotlib 

* Matplotlib is the most extensively used library of python for data visualization due to it's high flexibility and extensive functionality that it provides.

In [None]:
# importing matplotlib
import matplotlib.pyplot as plt

# display plots in the notebook itself
%matplotlib inline

### 3.1.1 Make a simple plot

* Let's create a basic plot to start working with!

In [None]:
height = [150,160,165,185]
weight = [70, 80, 90, 100]

In [None]:
# draw the plot

plt.plot(height, weight)
plt.show()

* We pass two arrays as our input arguments to **plot()** method and invoke the required plot. Here note that the first array appears on the x-axis and second array appears on the y-axis of the plot.

### 3.1.2 Title, Labels, and Legends
- Now that our first plot is ready, let us add the title, and name x-axis and y-axis using methods title(), xlabel() and ylabel() respectively.


In [None]:
plt.title("Relation between height and weight")

plt.xlabel("height")
plt.ylabel("weight")

plt.plot(height, weight)
plt.show()

### 3.1.3 Export figures

In [None]:
plt.title("Relation between height and weight")

plt.xlabel("height")
plt.ylabel("weight")

plt.plot(height, weight)
plt.savefig('./figure.pdf')
#files.download('./figure.pdf') 
plt.show()

## 3.2 Seaborn

* Seaborn (https://seaborn.pydata.org/) is a Python data visualization library based on matplotlib.
* It provides a high-level interface for drawing attractive and informative statistical graphics. It provide choices for plot style and color defaults, defines simple high-level functions for common statistical plot types, and integrates with the functionality provided by Pandas DataFrames.
* The main idea of Seaborn is that it provides high-level commands to create a variety of plot types useful for statistical data exploration, and even some statistical model fitting.

In [None]:
import seaborn as sns

### 3.2.1 Line Chart

- With some datasets, you may want to understand changes in one variable as a function of time, or a similarly continuous variable.
- Let's visualize real data of historical data (to 14 December 2020) on the daily number of new reported COVID-19 cases and deaths worldwide (source from https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide)

In [None]:
#df_covid = pd.read_csv('./covid19-worldwide2020.csv')

uploaded = files.upload()
df_covid = pd.read_csv(io.StringIO(uploaded['covid19-worldwide2020.csv'].decode('utf-8')))

df_covid.head()

* In seaborn, this can be accomplished by the **lineplot()** function: https://seaborn.pydata.org/generated/seaborn.lineplot.html
* How does the number of 'cases' change monthly?

In [None]:
cases_by_month = df_covid.groupby('month')['cases'].sum()

In [None]:
cases_by_month = cases_by_month.reset_index()

In [None]:
sns.lineplot(data=cases_by_month, x='month', y='cases')
plt.show()

### 3.2.2 Bar Chart

* Suppose we want to have a look at **what is the total cases for each continent?** 
* A bar chart is another simple type of visualization that is used for categorical variables.
* In seaborn, you can create a barchart by simply using the **barplot** function: https://seaborn.pydata.org/generated/seaborn.barplot.html

In [None]:
cases_by_continent = df_covid.groupby('continentExp')['cases'].sum()
cases_by_continent

In [None]:
cases_by_continent = cases_by_continent.reset_index()
cases_by_continent

In [None]:
sns.barplot(data=cases_by_continent, x='continentExp', y='cases')
plt.show()

### 3.2.3 Histogram

- **Distribution of cases**
- Histograms are a very common type of plots when we are looking at data like height and weight, stock prices, waiting time for a customer, etc which are continuous in nature. 
- Histogram’s data is plotted within a range against its frequency. 
- Histograms are very commonly occurring graphs in probability and statistics and form the basis for various distributions like the normal -distribution, t-distribution, etc.
- You can create a histogram in seaborn by simply using the **displot()**. There are multiple options that we can use which we will see further in the notebook.
- https://seaborn.pydata.org/generated/seaborn.displot.html

In [None]:
cases_by_country = df_covid.groupby('countriesAndTerritories')['cases'].sum()
cases_by_country

In [None]:
x = cases_by_country.values

sns.displot(x)
plt.show()

In [None]:
y = cases_by_country.values

sns.displot(y, kind='kde')
plt.show()

### 3.2.4 Box Plots

- **Distribution of cases**
- Box plot shows the three quartile values of the distribution along with extreme values (https://en.wikipedia.org/wiki/Box_plot)
- The “whiskers” extend to points that lie within 1.5 IQR(inter-quartile range)s of the lower and upper quartile, and then observations that fall outside this range are displayed independently. 
- This means that each value in the boxplot corresponds to an actual observation in the data.
- You can use the **boxplot()** for creating boxplots in seaborn: https://seaborn.pydata.org/generated/seaborn.boxplot.html

In [None]:
y = cases_by_country.values

sns.boxplot(y=y)
plt.show()

In [None]:
g = sns.boxplot(y=y)
g.set(yscale="log")
plt.show()

### 3.2.5 Scatter Plots

- **Relative distribution of total cases and deaths**
- It depicts the distribution of two variables using a cloud of points, where each point represents an observation in the dataset.
- This depiction allows the eye to infer a substantial amount of information about whether there is any meaningful relationship between them.
- You can use **relplot()** with the option of `kind=scatter` to plot a scatter plot in seaborn: https://seaborn.pydata.org/generated/seaborn.relplot.html

In [None]:
cases_and_death3 = df_covid.groupby('countriesAndTerritories')[['cases', 'deaths', 'popData2019', 'continentExp']].agg(
    {'cases':'sum', 'deaths':'sum', 'popData2019':'max', 'continentExp':'max'})

In [None]:
cases_and_death3

In [None]:
cases_and_death3 = cases_and_death3.reset_index(drop=True)
cases_and_death3

In [None]:
sns.relplot(x='cases', y='deaths', data=cases_and_death3, kind='scatter')
plt.show()

### 3.2.6 Hue semantic

* We can also add another dimension to the plot by coloring the points according to a third variable. In seaborn, this is referred to as using a “hue semantic”.*

In [None]:
sns.relplot(x='cases', y='deaths', data=cases_and_death3, kind='scatter', hue='continentExp', s=100, alpha=0.5)
plt.show()

### 3.2.7 Heat map

* You can correlations among each column by heatmap in seaborn.

In [None]:
cases_and_death3.corr()

In [None]:
sns.heatmap(cases_and_death3.corr(), annot=True)
plt.show()