## __Problem Statement:__

The complexity of the housing market can be overwhelming. For a data scientist at a real estate company, the responsibility lies in analyzing housing data to uncover insights into house prices. The goal is to comprehend the elements influencing house prices and the impact of various house features on their price. This understanding aids the company in navigating the housing market more effectively and making well-informed decisions when purchasing and selling houses.

## __Steps to Perform:__

__1. Understand the structure of the dataset, the types of variables, and any obvious issues in the data__

In [6]:
import pandas as pd

Housing_Data=pd.read_csv("housing_data.csv")
df=pd.DataFrame(Housing_Data)
df

Unnamed: 0.1,Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,0,SC60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,No,No,No,0,Feb,2008,WD,Normal,208500
1,1,SC20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,No,No,No,0,May,2007,WD,Normal,181500
2,2,SC60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,No,No,No,0,Sep,2008,WD,Normal,223500
3,3,SC70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,No,No,No,0,Feb,2006,WD,Abnorml,140000
4,4,SC60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,No,No,No,0,Dec,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1455,SC60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,No,No,No,0,Aug,2007,WD,Normal,175000
1456,1456,SC20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,No,MnPrv,No,0,Feb,2010,WD,Normal,210000
1457,1457,SC70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,No,GdPrv,Shed,2500,May,2010,WD,Normal,266500
1458,1458,SC20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,No,No,No,0,Apr,2010,WD,Normal,142125


__2. Check for duplicate entries in the dataset and decide how to handle them__

In [7]:
df.duplicated().sum()

0

__3. Identify and handle missing values. Decide whether to fill them in or drop them based on the context__

In [8]:
df.isnull()

Unnamed: 0.1,Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1456,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1457,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1458,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [9]:
#There is a column named "Alley" which had only null values . So I dropped the column.
df = df.drop(columns=["Alley"])
df

Unnamed: 0.1,Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,0,SC60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,...,0,No,No,No,0,Feb,2008,WD,Normal,208500
1,1,SC20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,...,0,No,No,No,0,May,2007,WD,Normal,181500
2,2,SC60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,...,0,No,No,No,0,Sep,2008,WD,Normal,223500
3,3,SC70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,...,0,No,No,No,0,Feb,2006,WD,Abnorml,140000
4,4,SC60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,...,0,No,No,No,0,Dec,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1455,SC60,RL,62.0,7917,Pave,Reg,Lvl,AllPub,Inside,...,0,No,No,No,0,Aug,2007,WD,Normal,175000
1456,1456,SC20,RL,85.0,13175,Pave,Reg,Lvl,AllPub,Inside,...,0,No,MnPrv,No,0,Feb,2010,WD,Normal,210000
1457,1457,SC70,RL,66.0,9042,Pave,Reg,Lvl,AllPub,Inside,...,0,No,GdPrv,Shed,2500,May,2010,WD,Normal,266500
1458,1458,SC20,RL,68.0,9717,Pave,Reg,Lvl,AllPub,Inside,...,0,No,No,No,0,Apr,2010,WD,Normal,142125


__4. Apply the necessary transformations to the variables. This could include scaling numerical variables or encoding categorical variables__

In [14]:
#StandardScaler will convert the column values from -1 to 1 and they will be normally distributed
from sklearn.preprocessing import StandardScaler 

standard_scaler=StandardScaler() # StandardScaler is used since certain ML algorithmn will not take the exact dataset values. It needs to be converted to specific values (-1 to 1)
#here price column values are converted from usual values from -1 to 1
df['2ndFlrSF']=standard_scaler.fit_transform(df[['2ndFlrSF']])
df['2ndFlrSF']

0       1.161852
1      -0.795163
2       1.189351
3       0.937276
4       1.617877
          ...   
1455    0.795198
1456   -0.795163
1457    1.844744
1458   -0.795163
1459   -0.795163
Name: 2ndFlrSF, Length: 1460, dtype: float64

In [13]:
#MinMaxScaler scales the data so that it is in the range of [0, 1]
from sklearn.preprocessing import MinMaxScaler #uniform distribution of data(0 to 1)
obj=MinMaxScaler()
df['LotFrontage']=obj.fit_transform(df[['LotFrontage']])
df['LotFrontage']

0       0.207668
1       0.255591
2       0.217252
3       0.191693
4       0.268371
          ...   
1455    0.198083
1456    0.271565
1457    0.210863
1458    0.217252
1459    0.239617
Name: LotFrontage, Length: 1460, dtype: float64

__5. For continuous variables, consider creating bins to turn them into categorical variables. For example, you can bin the __YearBuilt__ feature into decades__

In [1]:
import pandas as pd

Housing_Data=pd.read_csv("housing_data.csv")
df=pd.DataFrame(Housing_Data)
df

Unnamed: 0.1,Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,0,SC60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,No,No,No,0,Feb,2008,WD,Normal,208500
1,1,SC20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,No,No,No,0,May,2007,WD,Normal,181500
2,2,SC60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,No,No,No,0,Sep,2008,WD,Normal,223500
3,3,SC70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,No,No,No,0,Feb,2006,WD,Abnorml,140000
4,4,SC60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,No,No,No,0,Dec,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1455,SC60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,No,No,No,0,Aug,2007,WD,Normal,175000
1456,1456,SC20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,No,MnPrv,No,0,Feb,2010,WD,Normal,210000
1457,1457,SC70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,No,GdPrv,Shed,2500,May,2010,WD,Normal,266500
1458,1458,SC20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,No,No,No,0,Apr,2010,WD,Normal,142125


In [5]:
#Data Binning
#Define binedges and labels
YearBuilt=[1872,1900,1930,1960,1990,2010]
Decades=['1872-1900','1900-1930','1930-1960','1960-1990','1990-2010']
df['Decades']=pd.cut(df['YearBuilt'],bins=YearBuilt,labels=Decades)

df[['YearBuilt','Decades']].head(100)

Unnamed: 0,YearBuilt,Decades
0,2003,1990-2010
1,1976,1960-1990
2,2001,1990-2010
3,1915,1900-1930
4,2000,1990-2010
...,...,...
95,1993,1990-2010
96,1999,1990-2010
97,1965,1960-1990
98,1920,1900-1930


__6. Identify outliers in the dataset and decide on a strategy to handle them. You can use a box plot to visualize outliers in features like __LotArea__ or __SalePrice__.__

In [8]:
import seaborn as sns
df=sns.load_dataset('housing_data.csv')
sns.boxplot(x='LotArea',y='SalePrice',data=df)
plt.title("Box Plot: LotArea vs SalePrice")
plt.show()

ValueError: 'housing_data.csv' is not one of the example datasets.

In [None]:
import pandas as pd
LotArea=