# Housing Prices Exercise

Welcome to a quick exercise for you to practice your pandas skills! We will be using the Housing Prices dataset.

**Import pandas as pd.**

In [3]:
import pandas as pd

**Read housing_prices.csv as a dataframe called prices.**

In [40]:
df = pd.read_csv('housing_prices.csv')

**Check the head of the DataFrame.**

In [28]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,250000


**Use the .info() method to find out how many entries there are.**

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 11 columns):
Unnamed: 0     1460 non-null int64
Id             1460 non-null int64
MSSubClass     1460 non-null int64
MSZoning       1460 non-null object
LotFrontage    1201 non-null float64
LotArea        1460 non-null int64
Street         1460 non-null object
Alley          91 non-null object
LotShape       1460 non-null object
LandContour    1460 non-null object
SalePrice      1460 non-null int64
dtypes: float64(1), int64(5), object(5)
memory usage: 125.5+ KB


**What is the average LotArea size ?**

In [21]:
df['LotArea'].mean()

10516.828082191782

**What is the highest SalePrice in the dataset ?**

In [23]:
df['SalePrice'].max()

755000

**How many unique Streets are there?**

In [33]:
df['Street'].unique().shape[0]

2

**What are the top 2 most common LandContours?**

In [31]:
df['LandContour'].value_counts().head(2)

Lvl    1311
Bnk      63
Name: LandContour, dtype: int64

**On which street is the most expensive house located?**

In [36]:
df[df['SalePrice'] == df['SalePrice'].max()]['Street']

691    Pave
Name: Street, dtype: object

**On which street is the least expensive house located?**

In [37]:
df[df['SalePrice'] == df['SalePrice'].min()]['Street']

495    Pave
Name: Street, dtype: object

**What is the average (mean) SalePrice of all the houses grouped by Street?**

In [38]:
df.groupby('Street')['SalePrice'].mean()

Street
Grvl    130190.500000
Pave    181130.538514
Name: SalePrice, dtype: float64

**Done querying the database? Now Lets understand the dataset from machine learning point of view. Print the info to know which columns are categorical and which are numerical. The ones with data-type 'object' are definitely categorical.**

In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 10 columns):
Id             1460 non-null int64
MSSubClass     1460 non-null int64
MSZoning       1460 non-null object
LotFrontage    1201 non-null float64
LotArea        1460 non-null int64
Street         1460 non-null object
Alley          91 non-null object
LotShape       1460 non-null object
LandContour    1460 non-null object
SalePrice      1460 non-null int64
dtypes: float64(1), int64(4), object(5)
memory usage: 114.1+ KB


**Print the number of unique values in each column where data-type is not 'object'.**

In [43]:
for col in df.columns:
    if df[col].dtype != 'object':
        print (col, df[col].nunique())

Id 1460
MSSubClass 15
LotFrontage 110
LotArea 1073
SalePrice 663


**If the number of unique values in any column seems low, print all unique values in that column. Decide whether the column is categorical or not.**

In [44]:
df['MSSubClass'].value_counts()

20     536
60     299
50     144
120     87
30      69
160     63
70      60
80      58
90      52
190     30
85      20
75      16
45      12
180     10
40       4
Name: MSSubClass, dtype: int64

**Find out the number of NaNs in each column.**

In [41]:
df.isnull().sum()

Id                0
MSSubClass        0
MSZoning          0
LotFrontage     259
LotArea           0
Street            0
Alley          1369
LotShape          0
LandContour       0
SalePrice         0
dtype: int64

**Drop a column if it has a large number of NaNs (over half of the total instances).**
**Print head to ensure its been dropped.**

In [39]:
df.drop(['Alley'], axis=1, inplace=True)
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,SalePrice
0,1,60,RL,65.0,8450,Pave,Reg,Lvl,208500
1,2,20,RL,80.0,9600,Pave,Reg,Lvl,181500
2,3,60,RL,68.0,11250,Pave,IR1,Lvl,223500
3,4,70,RL,60.0,9550,Pave,IR1,Lvl,140000
4,5,60,RL,84.0,14260,Pave,IR1,Lvl,250000


**Fill up the NaNs in any columns that still have NaNs by replacing the NaNs with the mode of that column.**

In [45]:
df['LotFrontage'].fillna(df['LotFrontage'].mode()[0], inplace=True)