# Python for Data Analysis II

## Individual assignment - Kaumu Joshi - IE MBD T2

## Sections
* [1. Part (regular expressions)](#0)
* [2. Part (plotly)](#1)  

<a id='0'></a>
## 1. Part (regular expressions)

The goal is to extract dates of different formats from medical data.
We should correctly identify all of the different date variants encoded in this dataset and to properly standardize and sort the dates. Each line of the file corresponds to a medical note. Each note has a date that needs to be extracted, but each date is encoded in one of many formats.

### 1.1. Data loading & Import Libraries

In [1]:
# Read the data
with open("./medical_dataset.txt") as f:
    lines = f.readlines()
    
# Import required libraries
import pandas as pd
import re
pd.options.display.max_rows = None

### 1.2. String vectorization

In [2]:
# Transform the data into a pandas dataframe
df = pd.DataFrame(lines, columns=["text"])

#Check number of rows
len(df)

445

### 1.3. Identify Date Formats

In [3]:
df.head()

Unnamed: 0,text
0,03/25/93 Total time of visit (in minutes):\n
1,6/18/85 Primary Care Doctor:\n
2,sshe plans to move as of 7/8/71 In-Home Servic...
3,7 on 9/27/75 Audit C Score Current:\n
4,2/6/96 sleep studyPain Treatment Pain Level (N...


#### Types of formats:
1. mm/dd/yy or mm/dd/yyyy 
2. mm/yyyy (without the date)
3. Date, followed by month in letters and year: e.g. 12. September 2022 
4. Month in letters, followed by date and year: e.g. Sep. 12 2022 
5. Month in letters without date, followed by year: e.g. September 2022 

### 1.3. Design Regular Expression

In [4]:
# Create different regular expressions for each types of format above
expr_1 = r'(?:(\d{1,2}[-/]\d{1,2}[-/](?:\d{4}|\d{2})|'  
expr_2 = r'\d{1,2}[-/]\d{4}|' 
expr_3 = r'\d{1,2}[.,]? [A-Z][a-z]{2,}[,.]? (?:\d{4}|\d{2})|' 
expr_4 = r'[A-Z][a-z]{2,}[,|.]?\s\d{1,2}[,.]?\s(?:\d{4}|\d{2})|' 
expr_5 = r'[A-Z][a-z]{2,}[,.]? \d{4}'

# Merge all expressions into one
final_expr = expr_1 + expr_2 + expr_3 + expr_4 + expr_5 + r'))'

final_expr

'(?:(\\d{1,2}[-/]\\d{1,2}[-/](?:\\d{4}|\\d{2})|\\d{1,2}[-/]\\d{4}|\\d{1,2}[.,]? [A-Z][a-z]{2,}[,.]? (?:\\d{4}|\\d{2})|[A-Z][a-z]{2,}[,|.]?\\s\\d{1,2}[,.]?\\s(?:\\d{4}|\\d{2})|[A-Z][a-z]{2,}[,.]? \\d{4}))'

### 1.4. Transform Dataframe

In [5]:
# Extract the date as new column from the text by searching for the regexp
df["date"] = df["text"].str.extract(final_expr, expand = True)

# Show the new dataframe
df.head()

Unnamed: 0,text,date
0,03/25/93 Total time of visit (in minutes):\n,03/25/93
1,6/18/85 Primary Care Doctor:\n,6/18/85
2,sshe plans to move as of 7/8/71 In-Home Servic...,7/8/71
3,7 on 9/27/75 Audit C Score Current:\n,9/27/75
4,2/6/96 sleep studyPain Treatment Pain Level (N...,2/6/96


In [6]:
# Remove symbols (like . or ,) from our date values
df['date'] = df.date.str.translate({ord(i): None for i in '.,'}) 

# Split the date column into month, day and year, remove all other signs like / and - 
df[['month','day','year']]  = df.date.str.split(r'[ /-]', expand = True) 

df = df.drop(['date'],axis = 1)

df.head()

Unnamed: 0,text,month,day,year
0,03/25/93 Total time of visit (in minutes):\n,3,25,93
1,6/18/85 Primary Care Doctor:\n,6,18,85
2,sshe plans to move as of 7/8/71 In-Home Servic...,7,8,71
3,7 on 9/27/75 Audit C Score Current:\n,9,27,75
4,2/6/96 sleep studyPain Treatment Pain Level (N...,2,6,96


### 1.5. Assign date values to correct column

#### Tips

* Assume all dates where year is encoded in only two digits are years from the 1900's (e.g. 1/5/89 is January 5th, 1989)
* If the day is missing (e.g. 9/2009), assume it is the first day of the month (e.g. September 1, 2009).
* If the month is missing (e.g. 2010), assume it is the first of January of that year (e.g. January 1, 2010).
* There could be potential typos as this is a raw, real-life derived dataset.

In [7]:
# Assign the day, month and year into their correct respective columns

for i in range(len(df)):
    # check if day has 4 digit number, if so switch it with the value of year column
    if df.loc[i,'day'].isdigit() and len(df.loc[i,'day']) == 4:  
        df.loc[i, ['day', 'year']] = df.loc[i, ['year', 'day']].values 
    # check if day has letters and is not "None", if so swap with the value of the month column
    if df.loc[i,'day'] is not None and df.loc[i,'day'].isalpha():    
        df.loc[i, ['day', 'month']] = df.loc[i, ['month', 'day']].values 

### 1.6. Correct Formatting

In [8]:
# Uniform formatting of month, day and year

# Create a dictionary to decode month names into numbers
months_dict = { "jan" : 1,
       "feb" : 2,
       "mar" : 3,
       "apr" : 4,
       "may" : 5,
       "jun" : 6,
       "jul" : 7,
       "aug" : 8,
       "sep" : 9,
       "oct" : 10,
       "nov" : 11,
       "dec" : 12}  

for i in range(len(df)):
    # replace mising days with 1 as default
    if df.loc[i,'day'] is None: 
        df.loc[i,'day'] = 1
     # if year has only 2 digits, assume 19th century
    if len(df.loc[i,'year']) == 2:
        df.loc[i,'year'] = '19' + df.loc[i,'year']
    # convert month name to number using the dictionary months_dict
    if df.loc[i,'month'].isalpha():
        df.loc[i,'month'] = months_dict[df.loc[i,'month'][:3].lower()] 
        
#convert datatypes of month, day and year column to integer        
cols = ['month', 'day', 'year']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce', axis=1) 

print(df.dtypes)

df.head()

text     object
month     int64
day       int64
year      int64
dtype: object


Unnamed: 0,text,month,day,year
0,03/25/93 Total time of visit (in minutes):\n,3,25,1993
1,6/18/85 Primary Care Doctor:\n,6,18,1985
2,sshe plans to move as of 7/8/71 In-Home Servic...,7,8,1971
3,7 on 9/27/75 Audit C Score Current:\n,9,27,1975
4,2/6/96 sleep studyPain Treatment Pain Level (N...,2,6,1996


### 1.7. Saving the file

In [9]:
# Save to a new excel file with name "processed_dates.xlsx"
df.to_excel("processed_dates.xlsx")

<a id='1'></a>
## 2. Part (plotly)

### 2.1. Data loading & Import Libraries

In [4]:
import plotly.offline as py
import plotly.graph_objs as go
import plotly.express as px
import pandas as pd
import matplotlib.pyplot as plt
py.init_notebook_mode(connected=True) # this allows to display plotly graphs in Jupyter

# Read dataset into a pandas dataframe and show its first rows
df = pd.read_csv("./dataset_housing.csv")
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


### 2.2. Is there any relation between neighborhood and price?

In [5]:
# Sort dataframe by increasing housing price
df = df.sort_values('SalePrice')
# Plot Neighborhoods against their distribution of housing price using box plot
fig = px.box(df,y="SalePrice",  x="Neighborhood", title="Distribution of housing prices of neighborhoods"
            , color_discrete_sequence=["orange"], orientation = 'v')

fig.show()

print("There is a relationship between Neighborhoods and Price, you can see that some neighborhoods are more expensive than others")

There is a relationship between Neighborhoods and Price, you can see that some neighborhoods are more expensive than others


### 2.3. Is there any relation between neighborhood and year built?



In [6]:
# Plot Neighborhoods against the distribution of the building years of their properties using strip plot
fig = px.strip(df, x="Neighborhood", y="YearBuilt",color_discrete_sequence=["green"] 
               ,title="Building year of properties across different neighborhoods")

py.iplot(fig)


print("There is a relationship between Neighborhoods and Year built, you can see that some neighborhoods have newer properties than others.")
print("Comparing this data with the relation with price, the neighborhoods with newer properties also are the more expensive ones.")

There is a relationship between Neighborhoods and Year built, you can see that some neighborhoods have newer properties than others.
Comparing this data with the relation with price, the neighborhoods with newer properties also are the more expensive ones.


### 2.3. How overall quality, lot area, year built and price interact with each other?


#### 2.3.1 Check the distribution


In [7]:
from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=3)

fig.add_trace(
    go.Histogram(x=df["LotArea"], hovertemplate="<b>Bin Edges:</b> %{x}<br><b>Count:</b> %{y}<extra></extra>"),
    row=1, col=1
)

fig.add_trace(
    go.Histogram(x=df["YearBuilt"], hovertemplate="<b>Bin Edges:</b> %{x}<br><b>Count:</b> %{y}<extra></extra>"),
    row=1, col=2
)

fig.add_trace(
    go.Histogram(x=df["OverallQual"], hovertemplate="<b>Bin Edges:</b> %{x}<br><b>Count:</b> %{y}<extra></extra>"),
    row=1, col=3
)

fig.update_xaxes(title_font_size=16, tickfont_size=16)
fig.update_yaxes(title_font_size=16, tickfont_size=16)
fig.update_layout(
    title_text="Distribution of all three features",
    xaxis1_title_text="LotArea",
    yaxis1_title_text="Count",
    xaxis2_title_text="YearBuilt",
    yaxis2_title_text="Count",
    xaxis3_title_text="OverallQual",
    yaxis3_title_text="Count",
    hoverlabel_font_size=16,
    showlegend=False,
    height=400, 
    width=950
)
fig.show()

print("We can see data of LotArea is skewed and has outliers, by removing the them with a threshold of 25k we can better identify trends with the other variables")

# Remove outliers
df = df[df.LotArea < 25000]


We can see data of LotArea is skewed and has outliers, by removing the them with a threshold of 25k we can better identify trends with the other variables


#### 2.3.2 Plot the relationships

In [3]:
# Create subplots, showing relationship of quality, lot area, year built and price
fig = make_subplots(rows=3, cols=2)

fig.add_trace(
    go.Scatter(x=df["YearBuilt"], y = df["LotArea"],mode="markers"),
    row=1, col=1
)

fig.add_trace(
    go.Scatter(x=df["LotArea"], y = df["OverallQual"],mode="markers"),
    row=1, col=2
)

fig.add_trace(
    go.Scatter(x=df["LotArea"], y = df["SalePrice"],mode="markers"),
    row=2, col=1
)

fig.add_trace(
    go.Scatter(x=df["YearBuilt"],y = df["OverallQual"],mode="markers"),
    row=2, col=2
)

fig.add_trace(
    go.Scatter(x=df["YearBuilt"], y = df["SalePrice"],mode="markers"),
    row=3, col=1
)

fig.add_trace(
    go.Scatter(x=df["OverallQual"], y = df["SalePrice"],mode="markers"),
    row=3, col=2
)

fig.update_xaxes(title_font_size=16, tickfont_size=16)
fig.update_yaxes(title_font_size=16, tickfont_size=16)
fig.update_layout(
    title_text="Correlation of Quality, Lot Area, Year Built and Price",    
    xaxis1_title_text="YearBuilt",
    yaxis1_title_text="LotArea",
    xaxis2_title_text="LotArea",
    yaxis2_title_text="OverallQual",
    xaxis3_title_text="LotArea",
    yaxis3_title_text="SalePrice",
    xaxis4_title_text="YearBuilt",
    yaxis4_title_text="OverallQual",
    xaxis5_title_text="YearBuilt",
    yaxis5_title_text="SalePrice",
    xaxis6_title_text="OverallQual",
    yaxis6_title_text="SalePrice",
    hoverlabel_font_size=16,
    height=1200, 
    width=950   
)
fig.show()


print("Lot Area with Year Built: Lot area is almost evenly distributed for different years, no clear correlation.")

print("Quality with Lot Area: Quality is almost evenly distributed for different lot areas.")

print("Lot Area with Price: Price is slightly increasing for larger lot areas.")

print("Quality with Year Built: Quality shows an upward trend for newer properties.")

print("Year Built with Price: There is a clear trend visible, prices increase with newer properties.")

print("Quality with Price: There is a clear trend visible, prices increase with higher quality.")


NameError: name 'make_subplots' is not defined

### 2.4. How quality, lot area, year built and price interact with each other and evolve in time?

In [122]:
# Sort dataframe by increasing housing price
df = df.sort_values('YrSold')

fig1 = px.scatter(df, x="LotArea", y="YearBuilt", width=600, height=600, animation_frame="YrSold")
fig1.update_layout(xaxis_range=[0,50000])
fig1.show()

fig2 = px.scatter(df, x="LotArea", y="OverallQual", width=600, height=600, animation_frame="YrSold")
fig2.show()

fig3 = px.scatter(df, x="LotArea", y="SalePrice", width=600, height=600, animation_frame="YrSold")
fig3.update_layout(xaxis_range=[0,50000])
fig3.show()

fig4 = px.scatter(df, x="YearBuilt", y="OverallQual", width=600, height=600, animation_frame="YrSold")
fig4.show()

fig5 = px.scatter(df, x="YearBuilt", y="SalePrice", width=600, height=600, animation_frame="YrSold")
fig5.show()

fig6 = px.scatter(df, x="OverallQual", y="SalePrice", width=600, height=600, animation_frame="YrSold")
fig6.show()

print("The trend identified in 2.3.2. does not seem to change over the years where the properties were sold.")

The trend identified in 2.3.2. does not seem to change over the years where the properties were sold.
