In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
sb.set()

## Problem 1 - House Prices

Dataset from Kaggle Competition : **"House Prices - Advanced Regression Techniques"**
Source: https://www.kaggle.com/c/house-prices-advanced-regression-techniques (requires login) compiled by Dean De Cock

### a) import tran.csv data

In [None]:
hprices = pd.read_csv('train.csv')
hprices.sample(5)

#### Data fields
Here's a brief version of what you'll find in the data description file.

> SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.

> MSSubClass: The building class

>MSZoning: The general zoning classification

>LotFrontage: Linear feet of street connected to property

>LotArea: Lot size in square feet

>Street: Type of road access

>Alley: Type of alley access

>LotShape: General shape of property

>LandContour: Flatness of the property

>Utilities: Type of utilities available

>LotConfig: Lot configuration

>LandSlope: Slope of property

>Neighborhood: Physical locations within Ames city >limits

>Condition1: Proximity to main road or railroad

>Condition2: Proximity to main road or railroad (if a second is present)

>BldgType: Type of dwelling

>HouseStyle: Style of dwelling

>OverallQual: Overall material and finish quality

>OverallCond: Overall condition rating

>YearBuilt: Original construction date

>YearRemodAdd: Remodel date

>RoofStyle: Type of roof

>RoofMatl: Roof material

>Exterior1st: Exterior covering on house

>Exterior2nd: Exterior covering on house (if more than one material)

>MasVnrType: Masonry veneer type

>MasVnrArea: Masonry veneer area in square feet

>ExterQual: Exterior material quality

>ExterCond: Present condition of the material on the exterior

>Foundation: Type of foundation

>BsmtQual: Height of the basement

>BsmtCond: General condition of the basement

>BsmtExposure: Walkout or garden level basement walls

>BsmtFinType1: Quality of basement finished area

>BsmtFinSF1: Type 1 finished square feet

>BsmtFinType2: Quality of second finished area (if present)

>BsmtFinSF2: Type 2 finished square feet

>BsmtUnfSF: Unfinished square feet of basement area

>TotalBsmtSF: Total square feet of basement area

>Heating: Type of heating

>HeatingQC: Heating quality and condition

>CentralAir: Central air conditioning

>Electrical: Electrical system

>1stFlrSF: First Floor square feet

>2ndFlrSF: Second floor square feet

>LowQualFinSF: Low quality finished square feet (all floors)

>GrLivArea: Above grade (ground) living area square feet

>BsmtFullBath: Basement full bathrooms

>BsmtHalfBath: Basement half bathrooms

>FullBath: Full bathrooms above grade

>HalfBath: Half baths above grade

>Bedroom: Number of bedrooms above basement level

>Kitchen: Number of kitchens

>KitchenQual: Kitchen quality

>TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

>Functional: Home functionality rating

>Fireplaces: Number of fireplaces

>FireplaceQu: Fireplace quality

>GarageType: Garage location

>GarageYrBlt: Year garage was built

>GarageFinish: Interior finish of the garage

>GarageCars: Size of garage in car capacity

>GarageArea: Size of garage in square feet

>GarageQual: Garage quality

>GarageCond: Garage condition

>PavedDrive: Paved driveway

>WoodDeckSF: Wood deck area in square feet

>OpenPorchSF: Open porch area in square feet

>EnclosedPorch: Enclosed porch area in square feet

>3SsnPorch: Three season porch area in square feet

>ScreenPorch: Screen porch area in square feet

>PoolArea: Pool area in square feet

>PoolQC: Pool quality

>Fence: Fence quality

>MiscFeature: Miscellaneous feature not covered in other categories

>MiscVal: money Value of miscellaneous feature

>MoSold: Month Sold

>YrSold: Year Sold

>SaleType: Type of sale

>SaleCondition: Condition of sale

### b) how many observations (rows) and varables (columns)?

In [None]:
print("Data dimensions : ", hprices.shape)

### c) what are the data types of variables?

In [None]:
pd.set_option('display.max_rows', 81)
print(hprices.dtypes)

### d) what does the .info() method do?

In [None]:
hprices.info()

### e) What does .describe() do?


In [None]:
hprices.describe()

---
## Problem 2 - Summer Olympic 2016

Data imported from https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table

### a) import the Wikipedia page

In [None]:
wiki = pd.read_html('https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table')
print("Data type : ", type(wiki))

### b) How many tables are in this Wikipedia page?

In [None]:
print(len(wiki), "tables")

### c) Which one is the actual 2016 Summer Olympics medal table?

In [None]:
wiki[1].head()
print("second table")

### d) Extract the main table and store it as a new Pandas DataFrame

In [None]:
medals = wiki[1]
medals.head()

### e) Extract the TOP 20 countries from the medal table as a new DataFrame"

In [None]:
top20 = medals.head(20)
print(top20)

---
## Bonus Problem A

### "Census Income" dataset
'adult.data' file downloaded from the UCI Machine Learning Repository (in the “Data Folder”)
https://archive.ics.uci.edu/ml/datasets/Census+Income 

In [None]:
income = pd.read_csv('adult.data', header = None)
income.head()

In [None]:
income.shape

In [None]:
income.info()

In [None]:
income.describe()

---
## Bonus Problem B

### Summer Olympics

Data retrieved from different years of the Summer Olympics, by changing the year in https://en.wikipedia.org/wiki/2016_Summer_Olympics_medal_table

#### For example, year 2012 medals:

In [None]:
olp2012 = pd.read_html('https://en.wikipedia.org/wiki/2012_Summer_Olympics_medal_table')[1]
olp2012.head()

#### Loop to extract main tables from years 2000 - 2016
i.e. 2000, 2004, 2008, 2012, 2016

In [None]:
olp = list()
for i in range(2000, 2017, 4):
    olp.append(pd.read_html(f'https://en.wikipedia.org/wiki/{i}_Summer_Olympics_medal_table')[1])

#### Extract the TOP 20 countries from each medal table
store as new DataFrames

In [None]:
TOP20 = list()
for i in range(5):
    TOP20.append(olp[i].head(20))

In [None]:
for i in range(5):
    print(TOP20[i])