# Exploratory Data Analysis with Pandas

In [9]:
import pandas as pd
data = pd.read_csv('temporal.csv')
# View first 10 data rows
data.head(10)

Unnamed: 0,Mes,data science,machine learning,deep learning,categorical
0,2004-01-01,12,18,4,1
1,2004-02-01,12,21,2,1
2,2004-03-01,9,21,2,1
3,2004-04-01,10,16,4,1
4,2004-05-01,7,14,3,1
5,2004-06-01,9,17,3,1
6,2004-07-01,9,16,3,1
7,2004-08-01,7,14,3,1
8,2004-09-01,10,17,4,1
9,2004-10-01,8,17,4,1


In [4]:
# View how data is distributed, maximums ,minimums, mean and more
data.describe()

Unnamed: 0,data science,machine learning,deep learning,categorical
count,194.0,194.0,194.0,194.0
mean,20.953608,27.396907,24.231959,0.257732
std,23.951006,28.09149,34.476887,0.438517
min,4.0,7.0,1.0,0.0
25%,6.0,9.0,2.0,0.0
50%,8.0,13.0,3.0,0.0
75%,26.75,31.5,34.0,1.0
max,100.0,100.0,100.0,1.0


In [5]:
# View what type of data each column includes 
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194 entries, 0 to 193
Data columns (total 5 columns):
Mes                 194 non-null object
data science        194 non-null int64
machine learning    194 non-null int64
deep learning       194 non-null int64
categorical         194 non-null int64
dtypes: int64(4), object(1)
memory usage: 7.7+ KB


By default, pandas limits the number of rows and columns it displays. We increase the limits and we can visualize the whole data using below code.

In [6]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [7]:
format_dict = {'data science':'${0:,.2f}', 'Mes':'{:%m-%Y}', 'machine learning':'{:.2%}'}

In [8]:
data['Mes'] = pd.to_datetime(data['Mes'])

In [11]:
data.head(10).style.format(format_dict)

Unnamed: 0,Mes,data science,machine learning,deep learning,categorical
0,01-2004,12,18,4,1
1,02-2004,12,21,2,1
2,03-2004,9,21,2,1
3,04-2004,10,16,4,1
4,05-2004,7,14,3,1
5,06-2004,9,17,3,1
6,07-2004,9,16,3,1
7,08-2004,7,14,3,1
8,09-2004,10,17,4,1
9,10-2004,8,17,4,1


In [12]:
# Highlighting max and min values with colours
format_dict = {'Mes':'{:%m-%Y}'}
data.head(10).style.format(format_dict).highlight_max(color='darkgreen').highlight_min(color='#ff0000')

Unnamed: 0,Mes,data science,machine learning,deep learning,categorical
0,01-2004,12,18,4,1
1,02-2004,12,21,2,1
2,03-2004,9,21,2,1
3,04-2004,10,16,4,1
4,05-2004,7,14,3,1
5,06-2004,9,17,3,1
6,07-2004,9,16,3,1
7,08-2004,7,14,3,1
8,09-2004,10,17,4,1
9,10-2004,8,17,4,1


In [14]:
data.head(10).style.format(format_dict).background_gradient(subset=['data science', 'machine learning'], cmap='BuGn')

Unnamed: 0,Mes,data science,machine learning,deep learning,categorical
0,01-2004,12,18,4,1
1,02-2004,12,21,2,1
2,03-2004,9,21,2,1
3,04-2004,10,16,4,1
4,05-2004,7,14,3,1
5,06-2004,9,17,3,1
6,07-2004,9,16,3,1
7,08-2004,7,14,3,1
8,09-2004,10,17,4,1
9,10-2004,8,17,4,1


In [15]:
data.head(10).style.format(format_dict).bar(color='red', subset=['data science', 'deep learning'])

Unnamed: 0,Mes,data science,machine learning,deep learning,categorical
0,01-2004,12,18,4,1
1,02-2004,12,21,2,1
2,03-2004,9,21,2,1
3,04-2004,10,16,4,1
4,05-2004,7,14,3,1
5,06-2004,9,17,3,1
6,07-2004,9,16,3,1
7,08-2004,7,14,3,1
8,09-2004,10,17,4,1
9,10-2004,8,17,4,1


In [16]:
data.head(10).style.format(format_dict).background_gradient(subset=['data science', 'machine learning'], cmap='BuGn').highlight_max(color='yellow')

Unnamed: 0,Mes,data science,machine learning,deep learning,categorical
0,01-2004,12,18,4,1
1,02-2004,12,21,2,1
2,03-2004,9,21,2,1
3,04-2004,10,16,4,1
4,05-2004,7,14,3,1
5,06-2004,9,17,3,1
6,07-2004,9,16,3,1
7,08-2004,7,14,3,1
8,09-2004,10,17,4,1
9,10-2004,8,17,4,1


Pandas profiling is a library that generates interactive reports with our data, we can see the distribution of the data, the types of data, possible problems it might have. It is very easy to use, with only 3 lines we can generate a report that we can send to anyone and that can be used even if you do not know programming.

In [7]:
pip install --user pandas-profiling --no-warn-script-location

Collecting pandas-profiling
  Using cached pandas_profiling-2.11.0-py2.py3-none-any.whl (243 kB)
Collecting visions[type_image_path]==0.6.0
  Using cached visions-0.6.0-py3-none-any.whl (75 kB)
Collecting tqdm>=4.48.2
  Using cached tqdm-4.59.0-py2.py3-none-any.whl (74 kB)
Collecting htmlmin>=0.1.12
  Using cached htmlmin-0.1.12-py3-none-any.whl
Collecting seaborn>=0.10.1
  Using cached seaborn-0.11.1-py3-none-any.whl (285 kB)
Collecting confuse>=1.0.0
  Using cached confuse-1.4.0-py2.py3-none-any.whl (21 kB)
Collecting missingno>=0.4.2
  Using cached missingno-0.4.2-py3-none-any.whl (9.7 kB)
Collecting phik>=0.10.0
  Using cached phik-0.11.2-py3-none-any.whl
Collecting requests>=2.24.0
  Using cached requests-2.25.1-py2.py3-none-any.whl (61 kB)
Collecting ipywidgets>=7.5.1
  Using cached ipywidgets-7.6.3-py2.py3-none-any.whl (121 kB)
Collecting matplotlib>=3.2.0
  Using cached matplotlib-3.3.4-cp37-cp37m-win_amd64.whl (8.5 MB)
Collecting networkx>=2.4
  Using cached networkx-2.5-py3-n

In [10]:
from pandas_profiling import ProfileReport
prof = ProfileReport(data)
prof.to_file(output_file='data_report.html')

Summarize dataset:   0%|          | 0/18 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

# See report generated
Now you can see [data-report](https://data-report.netlify.app/)! generated.

# Pandas profiling disadvantage 
The main disadvantage of pandas profiling is its use with large datasets. With the increase in the size of the data the time to generate the report also increases a lot.

# Sloution 

One of the obvious way to tackle this disdavntage is generating report from only a part of whole dataset we have . It is important to make sure that the data selected to generate the report is representative whole dataset.

Another alternative is to use the minimum mode that was introduced in version 2.4 of pandas profiling. With the minimum mode a simplified report will be generated with less information than the full one but it can be generated relatively quickly for a large dataset.

# Liked this Project ???

Star it on [github](https://github.com/parthshingari28/Explorartory-Data-Analysis) !

 # Reach out at !

[@parthshingari28](https://github.com/parthshingari28)