 # Tutorial 2:   Data Exploration in Python


Data Exploration is one of the key step in data analysis. It involves basic task of getting primilarlary understanding on data such as, data structure, statistical distribution of features, interrelationships within features and more. There are following two motivations of using data exploration.

1. It helps in selecting appropriate data-preprocessing technique for the data set
2. Data exploration analysis helps in selecting suitable machine learning technique

Python library Pandas is a powerful way of gaining understanding on data.  Pandas is packed with several smart data manipulation and exploration methods that can be used for basic data analysis. Pandas when integrated with visualization tools can give deep understanding of data. The task of data exploration will use following two main libraries in Python

$\textbf{1. Pandas}$  

$\textbf{2. Matplotlib/Seaborn}$ 

Data exploration is divided into 3 main steps as described following:

$\textbf{1. Basic data details }$ 

$\textbf{2. Summary Statistics}$

$\textbf{3. Data Visualization}$




## 1.1 Basic Data Details

Using pandas, data sets of different formats can be loaded into python enviornment for analysis. Pandas reads the tabular data set as DataFrame object. The three common methods
used to read data set are as follows:

1. read_csv() : it uses commas as seperating delimeter

2. read_table() : it considers '\t' as default delimeter

3. read_excel() : it reads excel file

$\color{red}{Code:}$ The following code uses Pandas to read the CSV file stored on a specific location of the computer and saves it to in a DataFrame object named data. 

In [15]:
import pandas as pd

data = pd.read_csv(r'E:\(1)WORK OF HAMZA\VS Code\.vscode\.vscode\python.py\Lab 29-09-22\Housing.csv')

$\color{red}{Code}$: The size of the data set loaded in dataframe of pandas can be printed using following code

In [16]:
print("The number of rows in data set:", len(data))
print("The number of columns in data set:", len(data.columns))


The number of rows in data set: 545
The number of columns in data set: 13


$\color{red}{Code}$: The data types of features can be printed using following code

In [17]:
print(data.dtypes)

price                int64
area                 int64
bedrooms             int64
bathrooms            int64
stories              int64
mainroad            object
guestroom           object
basement            object
hotwaterheating     object
airconditioning     object
parking              int64
prefarea            object
furnishingstatus    object
dtype: object


$\color{red}{Code}$: The following code helps in changing data type of given feature to some different dtype. For example, converting int data type to float and vice versa, float to object and likewise.


In [18]:
data.Outlet_Establishment_Year = (data.Outlet_Establishment_Year).astype(float) 
data.Item_Visibility = (data.Item_Visibility).astype(int)

AttributeError: 'DataFrame' object has no attribute 'Outlet_Establishment_Year'

# 1.2 Summary Statistics

$\color{red}{Code}$: The following code gets basic statistics of numerical features of the data set such as mean, standard deviation, minimum and maximum value.  For the qualitative attributes in data set, it counts the frequency for each of its distinct values. 


In [19]:
from pandas.api.types import is_numeric_dtype
from pandas.api.types import is_object_dtype
for col in data.columns:
    if is_numeric_dtype(data[col]):
        print('%s:' % (col))
        print('\t Mean = %.2f' % data[col].mean())
        print('\t Standard deviation = %.2f' % data[col].std())
        print('\t Minimum = %.2f' % data[col].min())
        print('\t Maximum = %.2f' % data[col].max())
    if is_object_dtype(data[col]):
        print(data[col].value_counts())
    
    

price:
	 Mean = 4766729.25
	 Standard deviation = 1870439.62
	 Minimum = 1750000.00
	 Maximum = 13300000.00
area:
	 Mean = 5150.54
	 Standard deviation = 2170.14
	 Minimum = 1650.00
	 Maximum = 16200.00
bedrooms:
	 Mean = 2.97
	 Standard deviation = 0.74
	 Minimum = 1.00
	 Maximum = 6.00
bathrooms:
	 Mean = 1.29
	 Standard deviation = 0.50
	 Minimum = 1.00
	 Maximum = 4.00
stories:
	 Mean = 1.81
	 Standard deviation = 0.87
	 Minimum = 1.00
	 Maximum = 4.00
yes    468
no      77
Name: mainroad, dtype: int64
no     448
yes     97
Name: guestroom, dtype: int64
no     354
yes    191
Name: basement, dtype: int64
no     520
yes     25
Name: hotwaterheating, dtype: int64
no     373
yes    172
Name: airconditioning, dtype: int64
parking:
	 Mean = 0.69
	 Standard deviation = 0.86
	 Minimum = 0.00
	 Maximum = 3.00
no     417
yes    128
Name: prefarea, dtype: int64
semi-furnished    227
unfurnished       178
furnished         140
Name: furnishingstatus, dtype: int64


$\color{red}{Code}$: We can also  display the summary for all the attributes simultaneously in a table using the describe() function. If an attribute is quantitative, it will display its mean, standard deviation and various quantiles (including minimum, median, and maximum) values. If an attribute is qualitative, it will display its number of unique values and the top (most frequent) values.

In [20]:
data.describe(include='all')

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
count,545.0,545.0,545.0,545.0,545.0,545,545,545,545,545,545.0,545,545
unique,,,,,,2,2,2,2,2,,2,3
top,,,,,,yes,no,no,no,no,,no,semi-furnished
freq,,,,,,468,448,354,520,373,,417,227
mean,4766729.0,5150.541284,2.965138,1.286239,1.805505,,,,,,0.693578,,
std,1870440.0,2170.141023,0.738064,0.50247,0.867492,,,,,,0.861586,,
min,1750000.0,1650.0,1.0,1.0,1.0,,,,,,0.0,,
25%,3430000.0,3600.0,2.0,1.0,1.0,,,,,,0.0,,
50%,4340000.0,4600.0,3.0,1.0,2.0,,,,,,0.0,,
75%,5740000.0,6360.0,3.0,2.0,2.0,,,,,,1.0,,


$\color{red}{Code}$: The Covariance and Correlation among numerical features can be printed using following code

 

In [21]:
print('Correlation:')
data.corr()

Correlation:


Unnamed: 0,price,area,bedrooms,bathrooms,stories,parking
price,1.0,0.535997,0.366494,0.517545,0.420712,0.384394
area,0.535997,1.0,0.151858,0.19382,0.083996,0.35298
bedrooms,0.366494,0.151858,1.0,0.37393,0.408564,0.13927
bathrooms,0.517545,0.19382,0.37393,1.0,0.326165,0.177496
stories,0.420712,0.083996,0.408564,0.326165,1.0,0.045547
parking,0.384394,0.35298,0.13927,0.177496,0.045547,1.0


In [22]:
print('Covariance:')
data.cov()

Covariance:


Unnamed: 0,price,area,bedrooms,bathrooms,stories,parking
price,3498544000000.0,2175676000.0,505946.425931,486409.333378,682644.632825,619467.324204
area,2175676000.0,4709512.0,243.23214,211.346617,158.129368,659.989696
bedrooms,505946.4,243.2321,0.544738,0.138674,0.261589,0.088562
bathrooms,486409.3,211.3466,0.138674,0.252476,0.142171,0.076842
stories,682644.6,158.1294,0.261589,0.142171,0.752543,0.034043
parking,619467.3,659.9897,0.088562,0.076842,0.034043,0.74233


# 1.3 Data Visualization

$\color{red}{Code} :$ The distribution of the feature can be visualized by creating histograms. The histograms are binned for numerical features used hist() function of Pandas data frame whereas, for categorical columns we plot bar plot() using the value_count() and plot.bar(). 


$\color{red}{Code} :$ The following code demonstrate the plotting of histogram of a given feature

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
data['Item_Weight'].hist(bins=8)
plt.title('Distribution of Feature')
plt.xlabel('Range of bins')
plt.ylabel('# of data points falling in bins')

KeyError: 'Item_Weight'

$\color{red}{Code} :$ The following code demonstrate the plotting of bar plot of a given feature

In [23]:
data['Item_Fat_Content'].value_counts().plot.bar(title='Freq dist of Item_Identifier')
plt.title('Distribution of Feature')
plt.xlabel('Range of bins')
plt.ylabel('# of data points falling in bins')

KeyError: 'Item_Fat_Content'

$\color{red}{Code}$:  A boxplot can also be used to show the distribution of values for each attribute. The code below shows the boxplot of two selected features



In [None]:
data.boxplot(column=['Item_Weight','Item_MRP'])

$\color{red}{Code}$: For each pair of attributes, we can use a scatter plot to visualize their joint distribution.

In [None]:
pd.scatter_matrix(data, alpha=0.2, figsize=(10, 10))
plt.show()
    

$\color{red}{Code :}$ The following code plots a correlation matrix plot

In [None]:
plt.matshow(data.corr())
plt.colorbar()
plt.show()

# 2.  Exercise

1. Download the iris data set from https://archive.ics.uci.edu/ml/datasets/Iris . Read the data set in Python. Analyse all fetaures statiscally and interpret correlation results.

2. Download the Housing data set from https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ . Read the data set in Python. Analyse all fetaures statiscally and interpret correlation results.