# FinalCapstone_1_ClusteringNeighbourhoods

This is a notebook that contains the analysis for the first part of "What is the best neighbourhood to live in as a student at Imperial College London?" (*link:* https://github.com/namiyousef/Coursera_Capstone).

**NOTE:** if you are a peer examiner from the IBM Coursera course (IBM Applied Data Science Capstone), please only mark this notebook, and the relevant parts of the report. At this current time, I will not be able to complete the second part of the project (stated in my report). 

As such, this notebook only discussed the first part of my project: **Cluster London postcode districts to find similarities between them, and from that, the most 'appropriate' for students.**

The notebook structure will be something along these lines:
- Libraries needed 
- Data attainment
- Data exploration
- Data visualisation
- Data processing
- Modelling
- Evaluation
- Conclusion
- References

# 0 - Libraries needed, configuration

In [75]:
""" Libraries """

# file management and web scraping
import os
import urllib.request
try: 
    from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup




# mathematical

# data exploration
import pandas as pd
import numpy as np

# visualisation

# preprocessing

# modelling

# evaluation

""" Configuration """

# pandas
pd.set_option('display.max_columns', None)

# 1 - Data attainment

We are concerned with four means of getting data:
- **Population statistics:** https://data.london.gov.uk/dataset/office-national-statistics-ons-population-estimates-borough
- **Crime stats:** https://data.police.uk/data/fetch/159fe36a-b26d-4bb4-882a-803f490a7b2b/
- **Google API:** https://developers.google.com/maps/documentation/distance-matrix/overview
- **Rent prices:** https://www.rentbarometer.com/london/all-prices/by-postcode.html#BR

## Data attainment - Functions

In [48]:
def directory_to_df(paths, exclude = [None], filetype = '.csv',ignore_index = True, exception = 'except'):
    """ concatenates all files in directories into a dataframe
    components:
    path: path to the directory (must end with /)
    exclude: array of directories to excludes from the treatment
    filetype: a string of the file extension (must include .)
    ignore_index: boolean that tells pandas to ignore the index or not
    exception: takes a string. Any time a filename includes this string it is treated differently (for cases when you have
    more than one ) 
    """
    filenames = []
    file_column = []
    frames = []
    test_index = 1
    
    for path in paths:
        for filename in os.listdir(path):
            if filetype in filename and filename not in exclude:
                if exception in filename:
                    curr_df = pd.read_csv(path+filename)
                    curr_df = special_treatment(curr_df)
                    
                else:
                    curr_df = pd.read_csv(path+filename)                    
                frames.append(curr_df)
                filenames.append(filename.replace(filetype,''))
                for i in range(curr_df.shape[0]):
                    file_column.append(test_index)
                test_index+=1

    df = pd.concat(frames,ignore_index = ignore_index)
    df['files'] = file_column
    return df, filenames


def special_treatment(df):
    """ performs a custom operation on a dataframe
    components:
    df: dataframe to play on
    """
    columns = df.columns.values.tolist()
    columns.remove('date')
    df.drop('gyrZ',inplace = True, axis = 1)
    df.columns = columns
    df.reset_index(inplace = True)
    df.rename(columns= {'index':'date'},inplace = True)
    return df
    

### Population Statistics

In [49]:
path = ('/Users/yousefnami/Desktop/Yousef/PrivateTings/My Stuff/Courses'
        '/IBMDataScienceCertificate/CapstoneProject/Capstone/Coursera_Capstone/Data/PopStats.csv')

df_pop = pd.read_csv(path)

# explore the columns 'Unnamed'

for column in df_pop.columns.tolist():
    if 'Unnamed' in column:
        #print('Column name: {}\n'.format(column),df_pop[column].value_counts())
        pass
        
# they seem to be, for the most part, empty. Let's delete them

for column in df_pop.columns.tolist():
    if 'Unnamed' in column:
        df_pop.drop(column,axis = 1,inplace = True)
        
# other columns we don't need, or need relabelling

df_pop.drop('WD12CD',axis = 1, inplace = True)
df_pop.rename(columns={'WD12NM': 'PostDist','LAD12NM':'Borough'},inplace = True)
df_pop.head()

Unnamed: 0,Year,PostDist,Borough,all_ages (persons),m0,m1,m2,m3,m4,m5,m6,m7,m8,m9,m10,m11,m12,m13,m14,m15,m16,m17,m18,m19,m20,m21,m22,m23,m24,m25,m26,m27,m28,m29,m30,m31,m32,m33,m34,m35,m36,m37,m38,m39,m40,m41,m42,m43,m44,m45,m46,m47,m48,m49,m50,m51,m52,m53,m54,m55,m56,m57,m58,m59,m60,m61,m62,m63,m64,m65,m66,m67,m68,m69,m70,m71,m72,m73,m74,m75,m76,m77,m78,m79,m80,m81,m82,m83,m84,m85,m86,m87,m88,m89,m90plus,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f16,f17,f18,f19,f20,f21,f22,f23,f24,f25,f26,f27,f28,f29,f30,f31,f32,f33,f34,f35,f36,f37,f38,f39,f40,f41,f42,f43,f44,f45,f46,f47,f48,f49,f50,f51,f52,f53,f54,f55,f56,f57,f58,f59,f60,f61,f62,f63,f64,f65,f66,f67,f68,f69,f70,f71,f72,f73,f74,f75,f76,f77,f78,f79,f80,f81,f82,f83,f84,f85,f86,f87,f88,f89,f90plus
0,2002,Aldersgate,City of London,1571,7,2,4,4,5,1,0,3,1,1,1,4,1,6,4,0,3,3,1,0,3,0,0,7,16,20,13,15,19,16,23,25,21,12,16,19,20,10,16,19,16,12,9,13,16,15,10,10,7,14,15,40,21,8,31,20,16,16,21,12,13,8,7,9,13,6,5,3,1,4,10,11,6,7,10,9,6,5,6,3,3,4,4,1,1,1,4,1,2,1,4,3,3,6,6,3,0,1,1,4,1,4,1,4,0,3,0,0,1,4,3,4,6,19,7,17,12,17,17,22,18,26,19,18,15,8,9,19,10,17,7,17,9,7,7,10,1,14,13,7,10,16,15,24,15,23,15,15,7,12,8,4,7,4,4,5,9,15,7,12,16,9,6,3,3,11,4,6,12,0,0,3,1,9,3,2,0,1,1,1,0,2
1,2003,Aldersgate,City of London,1578,7,4,5,3,4,3,0,3,3,0,1,1,4,1,4,3,0,2,1,0,0,1,4,5,14,17,20,16,13,19,15,25,24,16,15,13,16,18,10,16,23,15,9,10,14,14,19,10,9,6,17,15,38,20,10,25,22,16,18,19,16,9,10,8,7,8,7,6,5,3,4,11,9,7,4,12,8,3,6,3,4,3,5,3,1,0,1,4,1,0,3,2,5,5,8,4,3,4,0,0,1,1,4,1,4,0,3,0,0,1,4,1,8,10,20,15,20,13,15,16,17,18,19,24,20,22,6,10,13,7,16,10,15,12,6,9,11,5,13,17,9,8,14,19,25,19,19,16,9,7,13,6,3,7,4,4,11,12,13,10,13,6,8,4,4,3,9,2,3,8,2,3,3,1,7,5,0,0,0,0,3,2
2,2004,Aldersgate,City of London,1559,3,6,4,4,3,4,4,2,3,3,2,1,1,4,0,3,3,0,0,4,0,2,0,7,5,17,12,22,18,14,19,14,15,23,12,17,16,17,21,7,13,20,17,11,10,11,10,20,10,10,6,19,17,32,20,7,21,24,18,18,24,16,8,9,10,6,9,6,6,3,1,4,8,9,6,8,12,8,4,4,3,6,4,4,4,2,1,1,2,1,2,5,2,2,2,6,5,3,4,1,0,2,2,4,0,3,0,2,0,0,0,7,5,11,14,25,13,19,21,12,15,16,17,16,23,17,16,6,14,13,5,19,6,14,12,6,10,14,6,16,18,10,6,14,15,21,19,20,16,7,4,7,6,1,8,4,11,14,9,13,10,9,9,6,3,2,6,9,2,6,9,3,2,3,1,5,1,0,0,0,0,2
3,2005,Aldersgate,City of London,1461,4,2,5,3,3,4,4,1,2,4,3,1,1,3,1,0,4,4,0,1,2,1,3,0,10,13,15,16,21,13,16,17,13,12,19,12,23,16,18,15,10,10,17,15,11,8,10,10,13,9,12,6,17,19,25,19,7,22,21,20,23,19,15,7,10,7,8,6,4,3,5,1,5,7,9,7,7,10,8,3,6,3,6,4,3,3,1,2,0,0,3,7,5,2,1,1,5,1,2,3,0,0,2,2,1,1,3,0,1,0,0,5,2,9,9,15,14,12,14,19,11,13,15,22,15,20,18,6,8,7,9,5,16,2,14,8,7,11,15,3,13,15,10,7,17,16,21,20,20,15,5,2,6,6,2,11,9,10,13,8,13,5,7,9,6,4,2,7,8,2,5,10,3,1,3,4,1,0,0,0,0,3
4,2006,Aldersgate,City of London,1474,3,2,3,5,4,4,3,4,2,2,4,1,1,0,1,2,0,3,6,0,1,3,2,7,0,10,13,18,19,26,14,12,8,3,15,23,13,22,12,18,14,11,9,17,14,11,11,11,13,13,10,13,8,20,20,22,22,7,22,21,20,22,19,15,6,7,6,8,7,3,2,4,0,4,7,10,8,4,10,9,5,5,2,5,4,2,1,2,2,1,4,8,5,3,2,1,1,4,3,2,5,2,0,1,0,0,0,2,1,1,0,0,3,2,8,20,6,14,17,19,26,13,15,15,21,13,21,9,14,10,8,10,7,15,6,13,9,11,11,11,4,14,13,10,10,17,15,21,17,19,11,2,1,4,2,1,16,12,9,14,8,6,3,8,7,6,5,2,5,8,3,5,10,4,2,5,0,1,0,0,0,4


### Crime Stats

In [50]:
path = ('/Users/yousefnami/Desktop/Yousef/PrivateTings/My Stuff/Courses'
        '/IBMDataScienceCertificate/CapstoneProject/Capstone/Coursera_Capstone/Data/Crime Data')

paths = os.listdir(path)
paths.remove('.DS_Store')
paths = [path+'/{}/'.format(item) for item in paths]
df_temp,_ = directory_to_df(paths)

In [51]:
df_crime = df_temp
df_crime.drop(['Crime ID', 'Falls within','Reported by','LSOA code',\
               'Last outcome category','Context'],axis = 1,inplace = True)
df_crime.head()

Unnamed: 0,Month,Longitude,Latitude,Location,LSOA name,Crime type,files
0,2020-01,-0.539301,50.8172,On or near Highdown Drive,Arun 009F,Other theft,1
1,2020-01,0.137065,51.583672,On or near Police Station,Barking and Dagenham 001A,Anti-social behaviour,1
2,2020-01,0.134947,51.588063,On or near Mead Grove,Barking and Dagenham 001A,Anti-social behaviour,1
3,2020-01,0.137065,51.583672,On or near Police Station,Barking and Dagenham 001A,Anti-social behaviour,1
4,2020-01,0.137065,51.583672,On or near Police Station,Barking and Dagenham 001A,Anti-social behaviour,1


### Rent Prices

In [52]:
""" Note: this cell is not currently useful. The function still needs improvement """

def extract_html(parsed_html,find=['table'],attrs_dict=[{}],remove_chars = ['']):
    """ extracts the useful content from HTML and returns a list of the desired outcomes
    components:
    parsed_html: this is html parsed using BeautifulSoup
    find: this tells the function what type of HTML types to look for, by default: TABLE
    Note that it is a list, so you can look for multiple things
    attrs_dict: specified which attributes to look for when searching for HTML tags i.e. find = table, attrs= 
    {'title':'mytitle'} will find the ALL tables with 'title' mytitle
    remove_chars: defaults to '', removes all entries with blank spaces
    """
    html_components = []
    for attributes, html_type in zip(attrs_dict,find): # will output an error if attrs undeclared
        temp_components = []
        html_array  = parsed_html.body.find_all(html_type,attrs = attributes)
        for html_instance in html_array:
            components = html_text.split('\n')
            components = list(filter(lambda a: a != '',components))
        temp_components.append(component for component in components)
        html_components.append(temp_components)
    return html_components 
# create a 'def htmlTable_to_df'

# ignore this cell for the time being, it requires further work

In [105]:
path = 'https://www.rentbarometer.com/london/all-prices/by-postcode.html#BR'

with urllib.request.urlopen(path) as response:
   html = response.read()

parsed_html = BeautifulSoup(html)
columns = ['Place','Studio','One Bed','Two Beds','Three Beds','Four Beds','Five Beds']
postcodes = []
data = [[] for column in columns]
html_components = parsed_html.body.find_all('table')#.text #,attrs={'content':'ER1'}
for component in html_components:
    component = component.text.split('\n') #split by line
    component = [item.replace('  ','') for item in component] #remove all blank spaces\
    component = [item.replace(',','') for item in component] #remove all blank spaces\
    component = list(filter(lambda a: a != '', component)) #remove all blank spaces
    del component[0:7]
    for index,item in enumerate(component):
        if item == 'n/a':
            item = np.nan 
        if isinstance(item,str):
            item = item.replace('£','')
        if index%7 == 0:
            split = item.split(' ')
            postcodes.append(split[-1])
            item = ' '.join(split[0:-1])
        data[index%7].append(item)   
        
df_rent = pd.DataFrame(data={column:data for column,data in zip(columns,data)})
df_rent['PostDist'] = postcodes
df_rent.head()

Unnamed: 0,Place,Studio,One Bed,Two Beds,Three Beds,Four Beds,Five Beds,PostDist
0,Bromley,,263,287,381.0,498.0,,BR1
1,Croyon,197.0,280,355,,,,CR0
2,Bethnal Green,292.0,370,461,583.0,622.0,,E2
3,Bow,,313,371,481.0,612.0,637.0,E3
4,Canary Wharf,405.0,507,585,719.0,,,E14
