# Exploratory data analysis (EDA)

The tasks that are assigned to analysts are quite diverse. However, it all starts with data.

In this course, we will not touch on the business side of data analysis, but at the same time, we need to understand that data is not taken "out of thin air". As well as the tasks associated with it. In the book written by [Bill Franks](https://play.google.com/store/books/details/%D0%91_%D0%A4%D1%80%D1%8D%D0%BD%D0%BA%D1%81_%D0%A0%D0%B5%D0%B2%D0%BE%D0%BB%D1%8E%D1%86%D0%B8%D1%8F_%D0%B2_%D0%B0%D0%BD%D0%B0%D0%BB%D0%B8%D1%82%D0%B8%D0%BA%D0%B5_%D0%9A%D0%B0%D0%BA_%D0%B2_%D1%8D%D0%BF%D0%BE%D1%85%D1%83_Big_Dat?id=yPvkDQAAQBAJ) about operational analytics, the author focuses on the fact that ill-conceived investments in the collection and storage of data on the principle of "what if they come in handy later" often do not justify themselves. Only after a specific goal has been set the process of collecting (or possibly buying) and analyzing data can begin.

Unfortunately, in practice, raw data is usually unsuitable and/or unusable for analysis. The process of preparing and cleaning the data (data preparation, preprocessing, data cleaning) can be **very time-consuming** and take more time than actually building and validating models based on data. Let's highlight some of the components of this process:

- data specification and understanding
- data editing, error correction
- working with missing values
- normalization (standartization)
- feature extraction and selection

As a result, we obtain data in a convenient format, usually in a tabular format. A table (or dataframe) has a "objects-features" structure: rows correspond to individual entities (objects, examples, instances), and columns correspond to attributes of these entities (features). We will see the example very soon.

## Python library ecosystem. NumPy

Python is a high-level general purpose programming language. Today it is the most popular programming language in Data Science and Machine Learning. However, "pure" Python has a number of disadvantages, mainly related to code execution time. Traditional data structures like lists and tuples, as well as "for-" and "while-" loops, are "slow", and in the case of big data analysis this becomes a problem.

Probably the most clear explanation of "slowness" of standard data types and loops we can see in the book by Jake VanderPlas: [https://jakevdp.github.io/PythonDataScienceHandbook/](https://jakevdp.github.io/PythonDataScienceHandbook/) (chapter 2, the paragraph "Understanding Data Types in Python").

The [NumPy](https://numpy.org/) library is designed to work with multidimensional arrays (ndarrays) in such a way that the execution time for large data operations is **significantly faster** (sometimes hundreds or even thousands of times) than using "pure" Python. The library contains a large number of fast and high-level operations with one-, two- and multidimensional arrays (tensors), as well as a number of vector and matrix algebra functions. All higher-level libraries in the Python ecosystem work on the basis of NumPy arrays (Pandas, Matplotlib, Scikit-Learn, Tensorflow deep learning libraries, PyTorch and many others), which makes studying the ideology of NumPy arrays and the capabilities of this library a "must-have" skill for a data analyst.

## Pandas library

[Pandas](https://pandas.pydata.org/) is a Python library for loading, preprocessing, transforming and combining data as well as for exploratory data analysis. Exploratory analysis precedes directly the construction of predictive machine learning models and is designed to help the researcher better understand the features of the dataset, the relationship (correlation) between features, and also draw the first simple conclusions based on the data. However, "simple" does not mean "bad". These (at first glance) primitive conclusions provide baselines for subsequent more complex models, and it may turn out that it is the patterns found at the exploratory analysis stage that will help achieve the desired goal without diving into complex machine learning models.

## Data visualization. Matplotlib and Seaborn libraries

An important component of exploratory data analysis is data visualization. High-quality graphs and charts help you see more than boring and monotonous tables. The Pandas library has built-in visualization tools based on Matplotlib graphics. The [Matplotlib](https://matplotlib.org/) library itself provides many low-level graphical tools, so that the researcher can control literally everything --- from the color of points to fonts on the coordinate axes. The [Seaborn](https://seaborn.pydata.org/) library contains more high-level features and is intended to make life easier for Matplotlib users to some extent by automating many routine things. Typically, the built-in Pandas graphics, Matplotlib and Seaborn libraries are used together, which we will demonstrate later.

## BI tools for data analysis

Intense competition in global industries has made businesses seek out ways of managing their business processes for sustained growth. Organizations are seeking business intelligence (BI) solutions to achieve competitiveness through advanced data analysis (especially in the era of Big Data).

These tools step up into collecting, analyzing, monitoring, and predicting future business scenarios by creating a clear perspective of all the data a company manages. Identifying trends, enabling self-service analytics, utilizing powerful visualizations, and offering professional **BI dashboards** are becoming the standard in business operations, strategic development, and ultimately, indispensable tools in increasing profit. And not just that, the self-service nature of these solutions gives access to every feature we just mentioned to all levels of users, without the need for any technical skills or specialized training. Making them the perfect solution to democratize the data analysis process and boost business performance.

Here are some well-known BI tools: Microsoft Power BI, Tableau, SAS Business Intelligence, Zoho Analytics etc. 



In [None]:
# Import of useful libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(); # fancy graphs

## Data loading

Pre-prepared and processed data is usually in a tabular format and stored as CSV files (as well as TSV, XLS, XLSX etc.). In this case, you should use the [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method. Data can also be loaded directly from tabular databases, and Pandas has the [read_sql](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html) method for this purpose. In other cases, [read_json](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html) and [other methods](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) can help.

In this course, we are going to work not only with "traditional" tabular data, but also with text and images.

The [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method has many configurable options. The most significant of them are: file name (or URL), type of cell separator (comma by default), presence of a header row (its number is indicated; by default, feature names are read from the first line of the file), presence of a column with row indices (identifiers) (a number is also specified; by default, --- is absent). For other options, please refer to the documentation.

In [None]:
df = pd.read_csv('../input/cardio_train.csv', sep=';')

## First look at the data

The [head(n)](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) method is designed to view the first **n** rows of the table (**n=5** by default). Similarly, the [tail(n)](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html) method returns the last **n** lines.


In [None]:
df.head()

If there are too many features (columns), it could be useful to transpose the output:

In [None]:
df.head(10).T

The [info()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) method allows you to display general information about the dataset. We can find out the type of each feature, as well as whether there are missing values in the data.

In [None]:
df.info()

The [describe()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) method allows you to collect some statistics for each numerical feature. For easier reading, the resulting table could be transposed.

In [None]:
df.describe().T

Note that some of the features are binary (**smoke**, **alco**, **active**, **cardio**), so standard descriptive statistics --- mean, standard deviation, median, quartiles --- do not make sense for them. In this case, the simple calculation of values, [value_counts()](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html), will be more useful. For example, this way we can find out how many patients with identified cardiovascular disease (CVD) are in the sample.

In [None]:
df['cardio'].value_counts()

We see that we have an approximately equal number of healthy and sick people, that is, classes 0 and 1 are balanced (we will talk about the problem of unbalanced classes later).

Finally, the **normalize** parameter allows you to find out the percentage of each class:

In [None]:
df['cardio'].value_counts(normalize=True)

## One-dimensional feature analysis

Let's look at the distribution of patient height values. The theory says that the height is a variable that usually has a normal distribution.

In [None]:
df['height'].hist();

The default chart turned out to be uninformative. Let's try to improve the situation by adding the **bins** parameter.

In [None]:
df['height'].hist(bins=20);

As expected, we have something similar to the normal distribution histogram. However, **outliers** --- values that "stand out" from the overall picture --- are not visible on it. Therefore, sometimes it is more useful to use **boxplot** ("box and whisker diagram").

In [None]:
sns.boxplot(df['height']);

The width of the "box" is equal to interquartile range (IQR, the difference between third $Q_3$ and first $Q_1$ quartiles). The vertical line inside the box shows the median (second quartile). "Whiskers" limit the points that fall into the interval $[Q_1-1.5*IQR; Q_3+1.5*IQR]$. Finally, individual points on the graph correspond to outliers --- values that are not typical for this sample. As you can see, there were quite a few of them.

## Two- and more-dimensional feature analysis

For example, a researcher may be interested in the question: what is the average age of healthy and sick patients? The age attribute has an inconvenient unit of measurement --- days, so let's convert it to a number of years.

In [None]:
df['age'] = (df['age'] / 365).round()

Note that we're using a **method** here, not a standard round **function**. This greatly speeds up the calculations. The operation "divide a column by a number" works intuitively --- each element is divided by this number. NumPy magic in action!

### GROUP BY

Attention: here we become familiar with one very useful operation --- grouping. The [groupby()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method works similar to the GROUP BY operator in SQL and allows you to group data by one or more attributes, then calculate the aggregates in each group. The syntax is quite simple, concise and intuitive.

In [None]:
df.groupby('cardio')['age'].mean()

Calculations show that the average age of people with CVD is slightly higher than that of healthy people. These calculations can also be visualized using Pandas' built-in graphics.

In [None]:
df.groupby('cardio')['age'].mean().plot(kind='bar') 
plt.ylabel('Age')
plt.show();

### Countplot
Now let's try to see how the number of healthy and sick patients is distributed by age groups. The [countplot](https://seaborn.pydata.org/generated/seaborn.countplot.html) of the Seaborn library will help us here.

In [None]:
plt.figure(figsize=(15, 8))
sns.countplot(y='age', hue='cardio', data=df);

An important observation --- starting from the age of 55, the number of sick patients exceeds the number of healthy ones.

### Scatter plot

A useful type of plot for investigating pairs of numeric features is a scatter plot. Consider the age and height of patients.

In [None]:
plt.scatter(df['age'], df['height']);

Here it becomes clear that our outliers in the data are simply input errors. Unless, of course, we did not conduct a study among Lilliputians :)

To study the joint distribution of two numerical features, the jointplot of the Seaborn library can be useful:

In [None]:
sns.jointplot(x='height', y='weight', data=df);

Errors and anomalies in the data are clearly visible in this graph as well. It can also be concluded that, without taking into account outliers, height and weight have distributions that are close to normal.

### Pivot tables

For the study of three or more features, pivot tables are useful tools. This tool is well known to advanced users of Excel spreadsheets, Google Spreadsheets. Consider how to use the pivot table to answer the questions:
- is it true that with age people become more prone to drinking alcohol;
- is it true that among smokers the percentage of CVD is higher.

In [None]:
# values - features by which the values of the aggfunc function are calculated
# index - features by which grouping is performed
df.pivot_table(values=['age', 'cardio'], index=['smoke', 'alco'], aggfunc='mean')

As you can see, the answers to both questions are negative. Drinking habits do not appear to be correlated with age, and CVD rates are higher among non-smokers.

To understand how drinking and smoking are related, let's look at the crosstab (contingency table):

In [None]:
pd.crosstab(df['smoke'], df['alco'])

So far, we can only say that there are significantly more non-drinking and non-smoking patients than all the rest. For reasonable conclusions about the relationship, one should turn to numerical calculations.

## Selecting data by condition. Indexing methods in Pandas

Sometimes we need to perform calculations not on the entire training set, but on some part of it. To do this, you need to know and understand how to access cells in dataframes.

Let's start with the study of one feature "in itself". Let's take **height** as an example.

In [None]:
h = df['height']
type(h) 

Thus, we see that the table (DataFrame) is a set of named columns (Series). Columns are accessed by the key --- column name, as in Python dictionaries. Technically, you can think of a dataframe as a dictionary of columns.

But what about the rows?

In [None]:
# first_patient = df[0]

Oops! We got an error: KeyError means there is no column named "0". That is, we cannot access the string through a regular index. To do this, we will need an "implicit" index (implicit loc, iloc).

In [None]:
first_patient = df.iloc[0]
print(first_patient)

Again, we see that technically a dataframe row is a dictionary. The keys of the dictionary are the names of the columns, the values are the values of the features for the given row.

To find out, for example, the age of the first patient (without storing it in a separate variable), you need to apply explicit indexing (loc):

In [None]:
print(df.loc[0, 'age'])

Let's return now to the variable **h**. Recall that we stored in it all the values from the **height** column. Height is in centimeters. Let's convert to meters.

In [None]:
h_meters = h / 100 
h_meters[:10] 

Above, in several charts, we saw that there are errors among the height values. Let's see how many patients are shorter than 125 cm. Attention, the question is --- how to solve this problem in the "classical" style?

In [None]:
%%timeit
lilliputs = 0
for value in h:
    if value < 125:
        lilliputs = lilliputs + 1

Now let's solve the same problem in NumPy style:

In [None]:
%%timeit
h[h < 125].shape[0]

So, the second method turned out to be faster by about 5 times on a data set of 70,000 values (relatively small). As the length of the vector grows, loops become hundreds and thousands of times slower than vectorized NumPy operations.

The selection from an array of values can be performed by a conditional index. It works similarly for selecting rows in a dataframe. Let's calculate the average age of smokers.

In [None]:
df[df['smoke'] == 1]['age'].mean()

Condition also can be complex:

In [None]:
df[(df['smoke'] == 1) & (df['cardio'] == 1)]['age'].mean()

## Dataframe filtering. Deleting rows and columns

The [drop()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) method is used to remove rows and columns in a dataframe. Consider deleting by keys and by condition.

Let's delete the target feature, **cardio**.

In [None]:
dummy_df = df.drop('cardio', axis=1)
dummy_df.head()

Now, let's delete the first 100 patients:

In [None]:
dummy_df = df.drop(np.arange(100), axis=0)
dummy_df.head()

And finally, we will remove all the patients with a height below 125 cm, as well as above 200 cm.

In [None]:
dummy_df = df.drop(df[(df['height'] < 125) | (df['height'] > 200)].index)
dummy_df.shape[0] / df.shape[0]

As you can see, the percentage of outliers is small --- the remaining sample is 99.9% of the original.

## Adding new features

In [None]:
df['height_cm'] = df['height'] / 100
df.head()

## Encoding feature values

Our dataset contains only numerical values, but often there are **categorical** features. In this case one of the encoding procedures must be applied at the preprocessing stage. The simplest type of encoding is the replacement of some values with others (**label encoding**). In this case, we will have to (solely for the purpose of demonstrating the operation of the method) apply the inverse operation. For example, we recode the feature "cholesterol level" according to the principle:

- 1 --- "low"
- 2 ---"normal"
- 3 ---"high"

In [None]:
new_values = {1:'low', 2:'normal', 3:'high'}
df['dummy_cholesterol'] = df['cholesterol'].map(new_values)
df.head()

Let's recode the target feature **cardio** into boolean (True/False).

In [None]:
df['cardio'] = df['cardio'].astype(bool)
df.head()

# Задания для самостоятельной работы

1. Определите количество мужчин и женщин среди испытуемых. Обратите внимание, что способ кодирования переменной gender мы не знаем. Воспользуемся медицинским фактом, а именно: мужчины в среднем выше женщин.

2. Верно ли, что мужчины более склонны к употреблению алкоголя, чем женщины?

3. Каково различие между процентами курящих мужчин и женщин?

4. Какова разница между средними значениями возраста для курящих и некурящих?

5. Создайте новый признак --- BMI (body mass index, индекс массы тела). Для этого разделите вес в килограммах на квадрат роста в метрах. Считается, что нормальные значения ИМТ составляют от 18.5 до 25. Выберите верные утверждения:

    (a) Средний ИМТ находится в диапазоне нормальных значений ИМТ.

    (b) ИМТ для женщин в среднем выше, чем для мужчин.

    (c) У здоровых людей в среднем более высокий ИМТ, чем у людей с ССЗ.

    (d) Для здоровых непьющих мужчин ИМТ ближе к норме, чем для здоровых непьющих женщин

6. Удалите пациентов, у которых диастолическое давление выше систолического. Какой процент от общего количества пациентов они составляли?

7. На сайте Европейского общества кардиологов представлена шкала [SCORE](https://www.escardio.org/static_file/Escardio/Subspecialty/EACPR/Documents/score-charts.pdf). Она используется для расчёта риска смерти от сердечно-сосудистых заболеваний в ближайшие 10 лет. 

    Рассмотрим верхний правый прямоугольник, который показывает подмножество курящих мужчин в возрасте от 60 до 65 лет (значения по вертикальной оси на рисунке представляют верхнюю границу).

    Мы видим значение 9 в левом нижнем углу прямоугольника и 47 в правом верхнем углу. Это означает, что для людей этой возрастной группы с систолическим давлением менее 120 и низким уровнем холестерина риск сердечно-сосудистых заболеваний оценивается примерно в 5 раз ниже, чем для людей с давлением в интервале [160, 180] и высоким уровнем холестерина.

    Вычислите аналогичное соотношение для наших данных.

8. Визуализируйте распределение уровня холестерина для различных возрастных категорий.

9. Как распределена переменная BMI? Есть ли выбросы 

10. Как соотносятся ИМТ и наличие ССЗ? Придумайте подходящую визуализацию.
