# Data analysis and presentation

## Preparation before analysis

Everytime you want to explore and extract information from a dataset, first you need to understand what kind of information it's possible obtain with the available data. In general, data is classified as:

**Numerical:** Also known as quantitative data, are datasets that represents counts or measures, like age, height or weight. Is possible with this type of data do statistical analysis and determine mean, median, standart deviation, etc. This data are also divided into two groups:


*   **Discrete:** Represented by integer numbers (ex: Age).
*   **Continuous:** Can assume any real value (ex: Weight, height).

**Categorical Data:** Also known as qualitaties, are the datasets that has non-numerical caracteristcs, they can be:


*   **Ordinal:** Type of data that can be ordered in some way that makes sense (ex.: Age range, stages of a disease, dates).
*   **Nominal:** Are basically defined by names, with no specif order (ex: Blood type, race, sex, yes/no).


![alt text](
https://cdn.shopify.com/s/files/1/1334/2321/articles/Picture1.png?v=1497575369)
Source: [Data analysis](https://legac.com.au/blogs/further-mathematics-exam-revision/further-mathematics-unit-3-data-analysis-types-of-data)


### Invalid or missing data

Everytime a dataset is collected and sent to analysis, a number of activities must to be done before is possible to extract any relevat and reliable information. In the previous topics, we've seen how to initialize the data exploration with Pandas. However, after obtaining our dataframe, it's needed to check the integrity of our data and clean them before we can do any analysis. According to [IBM Data Analytics](https://www.ibm.com/cloud/blog/ibm-data-catalog-data-scientists-productivity), 80% of the given time to an available dataset for analysis is spent cleaning up the data.

Data treatment is an important stage in data cleaning (if some data can't be used in the analysis, it's missing). We're going to use a small [data set](https://raw.githubusercontent.com/dataoptimal/posts/master/data%20cleaning%20with%20python%20and%20pandas/property%20data.csv), while big enought to understand how to deal with missing data.

Run the cells below to import the example data.

In [0]:
import pandas as pd

In [0]:
missing_data = pd.read_csv('https://raw.githubusercontent.com/dataoptimal/posts/master/data%20cleaning%20with%20python%20and%20pandas/property%20data.csv',sep=',')
missing_data

It is possible to notice the invalid data in the data frame above. Pandas can detect some invalid or missing values. For this data, it uses the label `NaN`.

The `isnull()` is an specific Pandas method to identify missing value in a series.

In [0]:
missing_data['NUM_BATH'].isnull()

See that the `isnull()` method returns `True` always that exists a missing value at the evaluated field.

Is not practical to aply the `isnull()` method manualy for each characteristic. To evaluate the number os missing values on each characteristc just combine the `sum()` method with the result of the `isnull()` method aplyied to all the data set.


In [0]:
missing_data.isnull().sum()

Not always Pandas will be able to indentify an invalid data. On our exemple there's an invalid data `'na'` on the series that represents the characteristic `NUM_BEDROOMS` and another invalid data `'--'` on the series that represents the characteristic `SQ_FT`. 

On this case, we can use the `unique()` or `value_counts()` methods to see the values existents on a series:

In [0]:
missing_data["NUM_BEDROOMS"].unique()

In [0]:
missing_data["NUM_BEDROOMS"].value_counts()

In [0]:
missing_data["SQ_FT"].value_counts()

Another invalid data case happens when a data differente from what is expected to an characteristic is found. The column `OWN_OCCUPIED` should contain values ​​in the format `Y` or `N`. However, in one of the lines is found the value `12`, that have no realtion with the other expected values.

On this case, we can use the `isin()` and `all()` methods, too see if all the values from a series respect the **domain** off expected values to that series.

* The `isin()` method evaluates if an nominal data is in a list of options, converting the original series in a values series of `True`(case the value is in) or `False` (otherwise).
* The `all()` method evaluates if all the new series of `True` and `False` values are equal to `True`.

In [0]:
missing_data["OWN_OCCUPIED"].unique()

In [0]:
domain_condition = missing_data["OWN_OCCUPIED"].isin(["Y","N"])
all(domain_condition)

To indentify which data from the series `"OWN_OCCUPIED"` desobey the informed condition, we can invert the search condition using the `~` operator (read as not):

In [0]:
missing_data[~domain_condition]

## Starting the analysis

The data from this part of the Tutorial will be loaded from an URL 

Let's let pandas download the dataset directly, informing only the it's URL where it's located. 

In [0]:
url_data = 'http://bit.ly/2cLzoxH'
data = pd.read_csv(url_data)
data.head(n=10)

In [0]:
#Take a quick look at the data structure
#data.dtypes

In [0]:
#data["continent"].value_counts()

Once there's concluded the data cleaning, the first set of tools that we can use to analyze are the **descriptive statistics**

Pandas offer the main **central** and **dispersion** measures, that we can apply to any numeric data series.

### Central Measures

![alt text](https://media.proprofs.com/images/discuss/user_images/153336/9973678110.jpg) Source:  [Measure of Central Tendency](https://www.proprofs.com/discuss/q/273982/which-of-the-following-is-not-measure-central-tendency)

**Mean**: The sum of all the measures devided by the number of observations on the dataset.

In [0]:
data.mean()

**Median**: The middle value that separates the bigger half from the smallest half on the dataset.

In [0]:
data["year"].median()

**Mode**: The value(s) that appear more frequently on the dataset.

In [0]:
data["year"].mode()

### Dispersion Measures

**Variance**: Indicates the data scattering in a series.

Is calculated as the mean distance of each value from an series to the series's mean. Each distance is raised square during the sum, so that positive and negative distantances do not cancel each other. Becouse of that, the magnitude order do not match with the ones from the data of the series.

A low variance indicates that the values from the series tend to be closer to the mean. A high variance indicates that the values are scattering

In [0]:
data["year"].var()

**Standard deviation**: Square root of variance. Retains all its properties, but presents the same magnitude order as the series data: 

In [0]:
data["year"].std()

**Quantiles**: They partition the ordered series values. A 25% quantile indicates that 25% of the series values ​​are less than that quantile. By convention, *** quartiles *** are the 25%, 50% and 75% quantiles, also known as first, second and third quartiles:


In [0]:
data["year"].quantile(0.25)

In [0]:
first_quartile = data.query(f"year < {data['year'].quantile(0.25)}")
first_quartile.shape

In [0]:
data.shape

In [0]:
# Testing for others quartiles

#data["year"].quantile(0.5)

In [0]:
#second_quartile = data.query(f"year < {data['year'].quantile(0.5)}")
#second_quartile.shape

### Other methods of descriptive statistcs

* `describe()`: Brings together many descriptive measures about the data, including the `count()`, `min()` and `max()` methods:

In [0]:
data["year"].describe()

In [0]:
data.describe()

* `nique()`: Informs the number of distinct values.

In [0]:
data.nunique()

In [0]:
data["year"].nunique()

* `sort_values()`: Sorts the values of an `DataFrame` or `Series`, in ascending or descending order. When using the `sort_values()` method of the `DataFrame`, We can specify multiple columns to sort. In this case, ties on the first column will be solved on the second column, and so on.

In [0]:
data["year"].sort_values().head()

In [0]:
data.sort_values(by=['year','country'],ascending=False).head()

In [0]:
#data.sort_values(by=['year','lifeExp'],ascending=True).head()

## Data Presentation

The analysis of central measures and dispersion of the `DataFrame` is usually deepened by visualizing the data series.

To start, let's load the necessary libraries:
- `matplotlib` is a library that serves exclusively to create graphics;
- `seaborn` is a library designed to create statistical graphs in Python. 
It is built on based of Matplotlib and is integrated with Pandas data structures.

By convention, we only load the `pyplot` module from the `matplotlib` library and call it `plt`.

In the case of `seaborn`, we load the entire library, call it `sns` and use its `set()` method to put its initial settings into effect.

In [0]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

### Histograms

With the commands offered by Pandas it is easy to construct a histogram. However, it is necessary to understand exactly what is being built.

In the excerpt below we say that from the `dados` set we will use the `lifeExp` column, which shows the life expectancy per year.

With the `hist(bins = 100)` method we will have the histogram with 100 different ranges of values.

In [0]:
data['lifeExp'].hist(bins=100)

Below we can see the (extreme) effect of building a histogram with few ranges of values (only two, in this case).



In [0]:
data['lifeExp'].hist(bins=2)

The case below is exactly the reverse of what was shown above: many ranges of values (1000 in the graph below) make understanding very difficult.

In [0]:
data['lifeExp'].hist(bins=1000)

In [0]:
#Ploting the data analysis of the feature 'year' shows that not all information can be analyzed using a histogram
#data["year"].hist(bins=12)

In [0]:
#data["year"].value_counts()

The standard Pandas histogram is basic and only serves for a quick look at the distribution of the data, but it does not tell the whole story.

In addition to there being no names on the X and Y axes, there is a region of the X axis being presented even if there is no data in it.

We can solve this by configuring the histogram using the following parameters:
  - `xlabelsize` and `ylabelsize` dictate the font size on the axes;
  - `xlabel` and `ylabel` are the methods that change the title of the axis and the size of that text;
  - `xlim` is also a method and determines the lower and upper limits of the horizontal axis.

Below we can see how to customize the information that appears in the histogram.

In [0]:
data['lifeExp'].hist(bins=100, grid=False, xlabelsize=12, ylabelsize=12)
plt.xlabel("Life expectancy", fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.title("Life expectancy distribution", fontsize=17)
plt.xlim([22.0,90.0])

Although it is convenient to use the `hist()` method directly from a series, the `seaborn` `distplot()` method is much more powerful.

In addition to presenting a histogram of the data, `distplot()` estimates a **probability distribution** of the data:

In [0]:
sns.distplot(data["lifeExp"])
plt.xlabel("Life expectancy", fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.title("Life expectancy distribution", fontsize=17)

In [0]:
#data["lifeExp"].describe()

In [0]:
#print("Median: ", data["lifeExp"].median())
#print("Mode: ", data["lifeExp"].mode())
#print("Variance: ", data["lifeExp"].var())

The probability distribution estimated in the graph above is an important source of information about the data.

We can compare it with a **normal distribution** using the `norm` method of the `scipy` library:

In [0]:
from scipy.stats import norm

In [0]:
sns.distplot(data["lifeExp"], fit=norm)
plt.xlabel("Life expectancy", fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.title("Life expectancy distribution", fontsize=17)

In [0]:
#Do the same analysis for the feature "gdpPercap"

#data['gdpPercap'].hist(bins=50)
#plt.xlabel("GDP per capita", fontsize=15)
#plt.ylabel("Frequency",fontsize=15)
#plt.title("GDP per capita distribution", fontsize=17)

In [0]:
#sns.distplot(data["gdpPercap"], fit=norm)

In this case, we see that the actual distribution of the data differs greatly from the normal distribution.

In fact, it is more like a **bimodal distribution**, which usually occurs when the data has normally distributed subsets.

The following two code cells produce graphs using life expectancy on the African continent and Europe, respectively, showing where the bimodal distribution of the graph above comes from:

In [0]:
africa_data = data.query("continent == 'Africa'")
europe_data = data.query("continent == 'Europe'")

sns.distplot(europe_data["lifeExp"], fit=norm)
sns.distplot(africa_data["lifeExp"], fit=norm)
plt.xlabel("Life expectancy", fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.title("Life expectancy distribution in african and european continents", fontsize=17)

In addition to being interesting from a statistical point of view, the graph above is socially impacting and worrying, such a difference in distributions.

### Boxplots and violin plots

Other types of graphs useful for analyzing distributions are obtained by the `boxplot()` and `violinplot()` methods of `seaborn`.

**Boxplot**: shows the quartiles of a series, represented by a box - the ends are the first and third quartiles, while the partition inside the box is the second quartile.

This type of graphic is also known as boxes and whiskers (box-and-whiskers), because the minimum and maximum elements are represented by the "mustaches" of the box.

A particularity of this graph is that the minimum and maximum elements are calculated according to the distance between the first and the third quartiles. Thus, values of the series that extrapolate these extreme values are considered outliers and appear in the boxplot as dots.

In [0]:
sns.boxplot(x="lifeExp", y="continent", data=data.sort_values("continent"))
plt.xlabel("Life expectancy distribution", fontsize=15)
plt.ylabel("Continent",fontsize=15)
plt.title("Life expectancy per continent", fontsize=17)

As we can see, Africa is the continent with the lowest life expectancy in general, while Asia is the continent where this data is most dispersed.

In the charts above, you can see that there are many outliers.

It is interesting to filter the data and analyze the life expectancy per year (for example).

The code below produces a boxplot of life expectancy for the year 2007:

In [0]:
data_2007 = data.query("year == 2007")
sns.boxplot(x="lifeExp", y="continent", data=data_2007.sort_values("continent"))
plt.xlabel("Life expectancy", fontsize=15)
plt.ylabel("Continent",fontsize=15)
plt.title("Life expectancy per continent (2007)", fontsize=17)

Outlining the year of analysis, we see far fewer outliers.

* **Violin plots**: combine the information present in a boxplot and density charts. Despite being extremely rich in information, they are not widespread in practice.

In [0]:
plt.figure(figsize=(12,6))
sns.violinplot(x="continent", y="lifeExp", data=data_2007)
plt.xlabel("Continent", fontsize=15)
plt.ylabel("Life expectancy",fontsize=15)
plt.title("Life expectancy per continent (2007)", fontsize=17)

### Other plots types

* **Scatterplots**: Visualization of a bivariate distribution is a scatterplot, where each observation is shown with point at the x and y values. \\
jointplot(): the default kind of plot. \\
scatterplot(): analogous to a rug plot on two dimensions.


In [0]:
sns.jointplot(x="lifeExp", y="gdpPercap", data=data);

In [0]:
sns.scatterplot(x="lifeExp", y="gdpPercap", data=data)

* **Categorical scatterplots**

In [0]:
sns.catplot(x="continent", y="gdpPercap", data=data);

As the size of the dataset grows, categorical scatter plots become limited in the information they can provide about the distribution of observations/values within each category. When this happens, bloxplots and violin plots are better ways for summarizing the distributional information in ways that facilitate easy comparisons across the category levels.

* Conclusions:

Understanding our data and how this data is grouped is the first step in data science pipelines;

Is truly important to stay alert to our data and how will be made it is organisation and visualisation to get **significant results** on our analyses;

To carry out these analyses, it is necessary to use measurements, distributions, relations and transformation with our data.