Hi, I'm Marco, a university student who would like to try to improve his knowledge in the world of Machine Learning. I will try to use these notebooks to explain the concepts I am studying so that I can pass them on to others who may find them useful. This is the turn of the **Exploratory Data Analysis**.

![](https://www.allbusiness.com/media-library/business-analytics.jpg?id=32092749)

# What's Exploratory Data Analysis?

Exploratory Data Analysis (EDA) **is an approach of analyzing data sets to summarize their main characteristics**, often **using statistical graphics** and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory data analysis **has been promoted by John Tukey since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments**.

![](https://www.amphilsoc.org/sites/default/files/2019-07/03.04.98.00.Photograph%20of%20John%20Tukey%2C%20Elizabeth%20Menzies%2C%20c.%201960_0.jpg)

# 1) Import Libraries

Let's go! Since we are carrying out the analysis on code, in this case Python, **the first step to do is import the libraries we need**.

> * **Pandas** provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
> * **Numpy** provides a high-performance for multidimensional arrays and matrices.
> * **Matplotlib** is a library for data visualization. 
> * **Seaborn** is a Python library for making statistical graphics. It builds on top of matplotlib and provides a high-level interface for drawing attractive and informative statistical graphics.


In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# 2) Import data and have a first look

Obviously, if you want to work on the data, **the second thing to do is to import the dataset and take a first look at the data**.

In [None]:
df = pd.read_csv("/kaggle/input/housesalesprediction/kc_house_data.csv")
df.head()

Now that we have imported our data, **let's try to understand which features (columns) the dataset is made up of, what their meaning is and what types of data they are.** To get the feature names of a pandas dataframe, you can use the `columns` attribute of the dataframe. The `dtypes` attribute of a DataFrame object returns the data type of each column in the DataFrame. 

* **'id'**: Dataframe index [int64]
* **'date'**: Date of the home sale [object]
* **'price'**: Price of the house [float64]
* **'bedrooms'**: Number of bedrooms [int64]
* **'bathrooms'**: Number of bathrooms [float64]
* **'sqft_living'**: Square feet of the structure [int64]
* **'sqft_lot'**: Square feet of the house [int64]
* **'floors'**: Number of floors [float64]
* **'waterfront'**: binary variable indicating whether house is sited by the waterfront [int64]
* **'view'**: Number of views of the house [int64]
* **'condition'**: Number of condition of the house [int64]
* **'grade'**: Number of grade of the house [int64]
* **'sqft_above'** [int64]
* **'sqft_basement'**: Square feet of the basement [int64]
* **'yr_built'**: Year of construction [int64]
* **'yr_renovated'**: Year of renovation [int64]
* **'zipcode'**: Zipcode [int64]
* **'lat'**: Latitude coordinates [float64]
* **'long'**: Longitude coordinates [float64]
* **'sqft_living15'**: average living space of 15 neighbours[int64]
* **'sqft_lot15'**: average lot space of 15 neighbours[int64]

Now that we have had a first look at the data and know what it represents let's start working on the data to clean it before extracting the information.

# 3) Cleaning and visualizing data

First, let's start by **checking** if our table is complete or if there is **missing data**. To calculate the sum of missing values in a pandas DataFrame, you can use the `isna()` method to check for missing values in the DataFrame, and then use the `sum()` method to count the number of missing values. 

In [None]:
df.isna().sum()

As you can see, the dataframe contains all the data that we can work with directly. In the event that there were missing data we would have had to use some technique to fill those "holes" with significant values based on the other data available or we could have eliminated the rows or columns containing the missing data even if in this case we would have lost other data that could have been useful to us.

Now that we are sure that all the data is there, let's try to understand if we need to change some type of data that does not have much meaning into a more significant type, such as, for example, it may be strange to see that the number of bathrooms is in decimal. How can the bathroom number be not an integer? Well, going to read the discussions regarding the dataset I found this excellent answer by [@harlfoxem](https://www.kaggle.com/datasets/harlfoxem/housesalesprediction/discussion/24804), so I would say that we can leave the number of bathrooms like this even if it is not very common

Finally, the last feature we need to fix is the date which appears out of order and ends with "T000000" which we don't need. So, using the `slice()` function that allows you to obtain substrings, I create the year, month and day columns so I can analyze them later.

In [None]:
df['year'] = df['date'].str.slice(0, 4) 
df['month'] = df['date'].str.slice(4, 6) 
df['day'] = df['date'].str.slice(6, 8) 
df = df.drop('date', axis=1)
df = df.drop('id', axis=1)
df.head(3)

**Let's look at the features of the dataframe individually**

> Number of bedrooms

In [None]:
fig1 = px.histogram(df, x="bedrooms",
                   title='Number of bedrooms',
                   labels={'bedrooms':'N° of bedrooms'})
fig1.show()
df["bedrooms"].describe()

As we can observe from the statistics, the fact that the data is mainly arranged on the left side but the graph also continues on the right where there seems to be no data is given by the fact that **there is a house that has a record of 33 bedrooms**, truly incredible!!!

In [None]:
df.loc[df['bedrooms'] == 33]

> Number of bathrooms

In [None]:
fig2 = px.histogram(df, x="bathrooms",
                   title='Number of bathrooms',
                   labels={'bathrooms':'N° of bathrooms'})
fig2.show()
df["bathrooms"].describe()

> Number of squared feet of living

In [None]:
fig3 = px.histogram(df, x="sqft_living",
                   title='Number of squared feet of living',
                   labels={'sqft_living':'N° of squared feet of living'})
fig3.show()
df["sqft_living"].describe()