# Data analysis and presentation

## Preparation for analysis

Everytime you want to explore and extract information from a dataset, first you need to understand what kind of information can be obtained from the available data. In general, data is classified as:

**Numerical data:** Also known as quantitative, they represent counts or measures, like age, height or weight. With this type of data, we can do statistical analysis and determine mean, median, standard deviation, etc. This type can de subdivided into two groups:

*   **Discrete:** Represented by integer numbers (ex: Age).
*   **Continuous:** Can assume any real value (ex: Weight, height).

**Categorical Data:** Also known as qualitative, they represent non-numerical characteristcs. They can be:

*   **Ordinal:** Type of data that can be ordered in some way that makes sense (ex.: Age range, stages of a disease, dates).
*   **Nominal:** Are basically defined by names, with no specif order (ex: Blood type, race, sex, yes/no).


> Check [this video]([https://www.youtube.com/watch?v=GlgA8OMgLxE]) for a nice explanation on these types.


### Invalid or missing data

Everytime a dataset is collected, a number of procedures must to be done before we extract any relevant and reliable information. In the previous notebooks, we've seen how to start the data exploration with Pandas. However, once we have the created a dataframe, we need to check the integrity of the data and clean it before we can proceed to analysis. According to [IBM Data Analytics](https://www.ibm.com/cloud/blog/ibm-data-catalog-data-scientists-productivity), 80% of the time spent on a dataset is spent cleaning up the data.

Treating data that is missing or invalid is an important stage in data cleaning (if some data can't be used in the analysis, it's missing). We're going to use a small [dataset](https://raw.githubusercontent.com/dataoptimal/posts/master/data%20cleaning%20with%20python%20and%20pandas/property%20data.csv), that is big enough to understand how to deal with missing data. Run the cells below to import the example data.

In [0]:
import pandas as pd

In [0]:
missing_data = pd.read_csv('https://raw.githubusercontent.com/dataoptimal/posts/master/data%20cleaning%20with%20python%20and%20pandas/property%20data.csv',sep=',')
missing_data

Notice the invalid data in the `DataFrame` above, some of which Pandas can detect and label as `NaN`. We can use the `isna()` method identify missing value in a series.

In [0]:
missing_data['NUM_BATH'].isna()

See that the `isna()` method returns `True` missing values in a given `Series`. However, it's not practical to manually apply this method to every feature. Instead, we combine the `sum()` method with the result of the `isna()` method applied to the dataframe:


In [0]:
missing_data.isna().sum()

Pandas isn't always able to identify invalid data. On our example, the `NUM_BEDROOMS` feature presents an `'na'` value, whereas feature `SQ_FT` presents a `'--'` value. In this case, we can use the `unique()` or `value_counts()` methods to see the values from a series:

In [0]:
missing_data["NUM_BEDROOMS"].unique()

In [0]:
missing_data["SQ_FT"].value_counts()

Another source of invalid data happens when the data does not respect the **domain** of the feature. Feature `OWN_OCCUPIED` should contain values ​​in the format `Y` or `N`, but we see value `12` for a given sample. In this case, we can use the `isin()` and `all()` methods too see if all values in a series respect its domain:

In [0]:
missing_data["OWN_OCCUPIED"].unique()

* The `isin()` method returns `True` for each value in a series that respects a given domain, and `False` otherwise:

In [0]:
domain_condition = missing_data["OWN_OCCUPIED"].isin(["Y","N"])
domain_condition.head()

* The `all()` method evaluates if all values in the new series equal `True`.

In [0]:
all(domain_condition)

To identify which data from `"OWN_OCCUPIED"` violate the given condition, we can invert the condition using the `~` (not) operator:

In [0]:
missing_data[~domain_condition]

## Starting the analysis

For this part of the tutorial, we are gonna load the data from an URL using Pandas. We can do this using the URL as argument to the `read_csv()` method:

In [0]:
url_data = 'http://bit.ly/2cLzoxH'
data = pd.read_csv(url_data)
data.head(n=10)

In [0]:
# Take a quick look at the data types
# data.dtypes

In [0]:
# data["continent"].value_counts()

For this dataset, we have no missing or invalid data, so we can start our analysis. The first set of tools that we can use come from **descriptive statistics**. Pandas offer the main **central** and **dispersion** measures, which we can apply to any numeric data series.

### Central measures

![alt text](https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTor8kh4Rw5xvOcKII09VV1WANhpsjp_WIB6-oQv4IpjfUWZTTn)

* **Mean**: The sum of all values, divided by the number of samples in the dataset.

In [0]:
data.mean()

* **Median**: The value that separates the bigger half from the smallest half on the dataset.

In [0]:
data["year"].median()

* **Mode**: The value(s) that appear more frequently in the dataset.

In [0]:
data["year"].mode()

### Dispersion Measures

**Variance**: Indicates the data scattering in a series, representing the mean distance from each value in series to the mean of the series. Each distance is squared, so that positive and negative distances do not cancel each other. Due to squaring, the scale of the variance does not match variance of the series. Ovearll, a low variance indicates that the values from the series tend to be closer to the mean. A high variance indicates that the values are scattered.

In [0]:
data["year"].var()

**Standard deviation**: Square root of the variance. Retains all its properties, but presents the same scale as the series: 

In [0]:
data["year"].std()

**Quantiles**: Partition the ordered values in a series. A 25% quantile indicates that 25% of the series values ​​are less than that quantile. By convention, ***quartiles*** are the 25%, 50% and 75% quantiles, also known as first, second and third quartiles:


In [0]:
data["year"].quantile(0.25)

In [0]:
first_quartile = data.query(f"year < {data['year'].quantile(0.25)}")
first_quartile.shape

In [0]:
data.shape

In [0]:
# Testing for others quartiles
# data["year"].quantile(0.5)

In [0]:
# second_quartile = data.query(f"year < {data['year'].quantile(0.5)}")
# second_quartile.shape

### Other descriptive statistics

* `describe()`: Brings together many descriptive measures about the data, including the `count()`, `min()` and `max()` methods:

In [0]:
data["year"].describe()

In [0]:
data.describe()

* `nunique()`: Informs the number of distinct values.

In [0]:
data.nunique()

In [0]:
data["year"].nunique()

* `sort_values()`: Sorts the values of a `DataFrame` or `Series`, in ascending or descending order. When using the `sort_values()` method of the `DataFrame`, we can specify multiple columns to sort. In this case, ties on the first column will be solved based on the second column, and so on.

In [0]:
data["year"].sort_values().head()

In [0]:
data.sort_values(by=['year','country'], ascending=False).head()

In [0]:
# data.sort_values(by=['year','lifeExp'], ascending=True).head()

## Data Presentation

Analyzing central and dispersion measures is usually deepened by data visualization. To start, let's load the necessary libraries:
- `matplotlib` is a library that serves exclusively to create graphics.
- `seaborn` is a library designed to create statistical graphs in Python. It's built on Matplotlib and it's integrated with Pandas data structures.

By convention, we only load the `pyplot` module from the `matplotlib` library and call it `plt`. In the case of `seaborn`, we load the entire library, call it `sns` and use its `set()` method to put its initial settings into effect.

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

### Histograms

It's quite easy to construct a histogram with Pandas and matplotlib. However, we need to understand exactly what is being built. In the example below, we'll use the `lifeExp` feature, which shows the life expectancy per year. With the `hist(bins = 100)` method, we produce the histogram with 100 different value ranges.

In [0]:
data['lifeExp'].hist(bins=100)

Below, we can see the (extreme) effect of building a histogram with few value ranges (only two, in this case).



In [0]:
data['lifeExp'].hist(bins=2)

The next case is the opposite of the above: many value ranges (1000) make understanding the plot very difficult.

In [0]:
data['lifeExp'].hist(bins=1000)

In [0]:
#Ploting the data analysis of the feature 'year' shows that not all information can be analyzed using a histogram
#data["year"].hist(bins=12)

In [0]:
#data["year"].value_counts()

The standard matplotlib  histogram is basic and only serves for a quick look at the data distribution. Note that there are no names on the x and y axes, and that there is an x-axis region being presented even if there is no data on it.

We can customize the histogram using the following parameters:
  - `xlabelsize` and `ylabelsize` set the font size on the axes;
  - `xlabel` and `ylabel` change the title of the axis and the size of that text;
  - `xlim` determines the lower and upper limits of the horizontal axis.

In [0]:
data['lifeExp'].hist(bins=100, grid=False, xlabelsize=12, ylabelsize=12)
plt.xlabel("Life expectancy", fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.title("Life expectancy distribution", fontsize=17)
plt.xlim([22.0,90.0])

Although it is convenient to use the `hist()` method directly from a series, the `seaborn` `distplot()` method is much more powerful. In addition to presenting a histogram of the data, `distplot()` estimates a **probability distribution** of the data:

In [0]:
sns.distplot(data["lifeExp"])
plt.xlabel("Life expectancy", fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.title("Life expectancy distribution", fontsize=17)

In [0]:
# data["lifeExp"].describe()

In [0]:
# print("Median: ", data["lifeExp"].median())
# print("Mode: ", data["lifeExp"].mode())
# print("Variance: ", data["lifeExp"].var())

The probability distribution estimated in the graph above is an important source of information about the data. We can compare it with a **normal distribution** using the `norm` method of the `scipy` library:

In [0]:
from scipy.stats import norm

In [0]:
sns.distplot(data["lifeExp"], fit=norm)
plt.xlabel("Life expectancy", fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.title("Life expectancy distribution", fontsize=17)

In [0]:
# Do the same analysis for the feature "gdpPercap"

# data['gdpPercap'].hist(bins=50)
# plt.xlabel("GDP per capita", fontsize=15)
# plt.ylabel("Frequency",fontsize=15)
# plt.title("GDP per capita distribution", fontsize=17)

In [0]:
# sns.distplot(data["gdpPercap"], fit=norm)

In this case, we see that the actual distribution of the data differs greatly from the normal distribution. In fact, it is more like a **bimodal distribution**, which usually occurs when the data has normally distributed subsets.

The following two code cells produce graphs using life expectancy on Africa and Europe, respectively, showing where the bimodal distribution of the graph above comes from:

In [0]:
africa_data = data.query("continent == 'Africa'")
europe_data = data.query("continent == 'Europe'")

sns.distplot(europe_data["lifeExp"], fit=norm)
sns.distplot(africa_data["lifeExp"], fit=norm)
plt.xlabel("Life expectancy", fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.title("Life expectancy distribution in african and european continents", fontsize=17)

In addition to being interesting from a statistical point of view, the graph above is socially impacting and worrying, given such a difference in distributions.

### Boxplots and violin plots

The `boxplot()` and `violinplot()` methods of `seaborn` produce other types of graphs useful for distribution analysis:

**Boxplot**: shows the quartiles of a series, represented by a box. The box edges are the first and third quartiles, while the partition inside the box is the second quartile. This type of graphic is also known as boxes and whiskers (box-and-whiskers), because the minimum and maximum elements are represented by the "whiskers" of the box.

A particularity of this graph is that the minimum and maximum elements are calculated according to the distance between the first and the third quartiles. Values of the series that extrapolate these extreme values are considered outliers and appear in the boxplot as dots.

In [0]:
sns.boxplot(x="lifeExp", y="continent", data=data.sort_values("continent"))
plt.xlabel("Life expectancy distribution", fontsize=15)
plt.ylabel("Continent",fontsize=15)
plt.title("Life expectancy per continent", fontsize=17)

As we can see, Africa is the continent with the lowest life expectancy in general, while Asia is the continent where this data is most scatterd. In the charts above, you can see that there are many outliers. We can further investigate the data filtering the life expectancy per year. The code below produces a boxplot of life expectancy for the year 2007:

In [0]:
data_2007 = data.query("year == 2007")
sns.boxplot(x="lifeExp", y="continent", data=data_2007.sort_values("continent"))
plt.xlabel("Life expectancy", fontsize=15)
plt.ylabel("Continent",fontsize=15)
plt.title("Life expectancy per continent (2007)", fontsize=17)

Outlining the year of analysis, we see far fewer outliers.

* **Violin plots**: combine the information present in a boxplot and density charts. Despite being extremely rich in information, they are not widespread in practice.

In [0]:
plt.figure(figsize=(12,6))
sns.violinplot(x="continent", y="lifeExp", data=data_2007)
plt.xlabel("Continent", fontsize=15)
plt.ylabel("Life expectancy",fontsize=15)
plt.title("Life expectancy per continent (2007)", fontsize=17)