# Intro

This practice is done thanks to *Data Science Lab: Process And Methods* course that is tought in [Data Science and Engineering](https://didattica.polito.it/laurea_magistrale/data_science/en/home) program in [Politecnico di Torino](https://www.polito.it/index.php?lang=it) for laboratory purpose during the 2022/2023 semester. The aim is making data exploration analysis on *New York - Point of Interest* real-world dataset ([download](https://github.com/dbdmg/data-science-lab/raw/master/datasets/NYC_POIs.zip)) by using [Pandas data analysis library](https://pandas.pydata.org/). Also Numpy and Matplotlib libraries are used.




# Datasets
There are 2 databases we will work with:

### 1. **New York Point of Interest** dataset
This dataset shows the sub-sample of point of interests (POI) placed in the city of New York. Each row demonsrates a POI with its **coordinates** and the **category** to which it belongs to. Each category has its own column in which **types** of POIs are written. Fields in the dataset are:
1. _@id_: a unique id for each point of interest.
1. _@lat_: latitude coordinate of the POI in decimal degrees.
1. _@lon_: longitude coordinate of the POI in decimal degrees.
1. _amenity_: if the POI category is amenity its type is reported in this field.
1. _name_: is not used
1. _shop_: if the POI category is shop its type is reported in this field.
1. _public_transport_: if the POI category is public transport its type is reported in this field.
1. _highway_: if the POI category is highway its type is reported in this field.


### 2. **The New York City municipality POIs** dataset
Some of POIs belong to NYC municipality. In order to identify them, a further file is provided. This file contains only **ID**s which corresponds to NYC Municipality POIs

*NOTE*: The map of the New York municipality is provided as well

# Exercises and Solutions

**1.** Loading datasets and filtering the ones that belong NYC municipality

In [None]:
#Loading libraries
import pandas as pd
import numpy as np
import warnings #not show warnings as output
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
#reading the csv file (sep='\t' is used since the files are tab seperated)
df = pd.read_csv("C:/Users/Koparan/Desktop/DataPolito/DataScienceLab/laboratory/solutions/lab5/pois_all_info",
                sep='\t')
df.head()

In [None]:
#exploring data fields with their data types in addition to null values
df.info()

In [None]:
#reading NYC municipality POIs file
#header=None because this file does not contain header. 
#We need to prevent getting the first ID as header
ny_id = pd.read_csv("C:/Users/Koparan/Desktop/DataPolito/DataScienceLab/laboratory/solutions/lab5/ny_municipality_pois_id.csv",
                    sep='\t',
                    header=None)
ny_id.head()

In [None]:
#changing the column name as @id in the NYC municipality like the ID column in the main file
ny_id.set_axis([df.columns[0]],axis=1,inplace=True)
ny_id.info()

In [None]:
'''
- We can filter municipality POIs by using 'merge' function.
- The column name we merge must be the same.
- how='inner' is used because we want to work with the IDs that must be 
belong to both datasets.
'''
nydf = pd.merge(df,ny_id,on=df.columns[0],how='inner')

**2.** Analysing the distribution of POI types for each POI category(*amenity*, *shop*, *public_transport*, *highway*)

In [None]:
#getting categories as list will make the following analysis easier
#first we eliminate the columns with contains '@' e.g. '@lat'
categories = nydf[nydf.columns.drop(list(nydf.filter(regex='@')))]
#then, we eliminate name columns that will not be useful for our study
categories = categories.drop(columns=['name'])
categories = list(categories.columns)
categories

In [None]:
#create a dict to collect each category and its types with different key values
dfs = {}
#fill the dict with values based on key values
for cat in categories:
    dfs['df_'+cat] = nydf.loc[:,['@type',cat]]
#clear NaN values which corresponds to different rows for each POI type
for k in dfs:
    dfs[k].dropna(subset=[dfs[k].columns[1]],inplace=True)

In [None]:
#to visualize effectively, get first 80% of POI names. We will use this function for every POI
def get_top_perc(series, perc_value=0.8):
    perc = series.cumsum() / series.sum()
    arg = (perc>perc_value).values.argmax()
    return series.iloc[:arg+1]

In [None]:
#create a figure and add 4 subplot for each category. We will have 4 histogram
for col in categories:
    p = .8 #threshold value
    valc = dfs['df_'+col].iloc[:,1].value_counts()
    valf = get_top_perc(valc,p)
    fig,ax = plt.subplots() #create a figure
    valf.plot(kind='bar',ax=ax) #bar=histogram
    ax.set_xticklabels(ax.get_xticklabels(), rotation=90) #rotate the x axis names as vertical
    fig.suptitle(f"Top {p*100:.0f}% points in the category: {col}") #title for selected POI in the loop

**3.** Show the POIs in New York map with scatter plot. For each POI type, select a different colour

In [None]:
#location and types
dfs = {}
for cat in categories:
    dfs['df_'+cat] = nydf.loc[:,['@lat','@lon',cat]]
#delete NaN values for each type
for k in dfs:
    dfs[k].dropna(subset=[dfs[k].columns[2]],inplace=True)

In [None]:
import seaborn as sns
from matplotlib.cm import get_cmap

class Map:
    def __init__(self,df):
        #this dataframe will be used for the map. We relate it with the class object
        self.pois_df = df
        #to scale the map, we get max and min values for vertical and horizontal axes.
        self.lat_min = df['@lat'].min()
        self.lat_max = df['@lat'].max()
        self.long_min = df['@lon'].min()
        self.long_max = df['@lon'].max()
    def plot_map(self):
        """Display the image with NY map and return the Axes object"""
        fig, ax = plt.subplots()
        nyc_img = plt.imread("C:/Users/Koparan/Desktop/DataPolito/DataScienceLab/laboratory/solutions/lab5/New_York_City_Map.PNG")
        ax.imshow(nyc_img, zorder=0, extent = [self.long_min,
                                               self.long_max,
                                               self.lat_min,
                                               self.lat_max])
        ax.grid(False)
        return ax
    def plot_pois(self,ax,category,mask):
        """Plot data on specified Axis"""
        df = self.pois_df.loc[mask]
        types = df[category].unique()
        cmap = get_cmap('viridis')
        colors = cmap(np.linspace(0,1,types.size))
        for i,t in enumerate(types):
            df_t = df.loc[df[category]==t]
            c = [colors[i]]*df_t.shape[0]
            df_t.plot.scatter(x='@lon',y='@lat',
                             ax = ax,
                             c = c,
                             alpha=.6,
                             label=t)
        ax.legend()
        ax.grid(False)
        return ax

In [None]:
#the function is defined for any POI type ('column'). 
def show_category_on_map(df, column, perc_value):
    counts = df[column].value_counts()
    top_freq = get_top_perc(counts, perc_value)
    np_map=Map(df)
    
    ax = np_map.plot_map()
    mask = df[column].isin(top_freq.index) #this masking get only values which match the 80% of the values from the selected POI type column 
    np_map.plot_pois(ax,column,mask)

#you can use any POI type here. I choose amenity to illustrate
show_category_on_map(dfs['df_amenity'],'amenity',.5)

**4.** From the New York map, get grids and numerate them. Then, assign each POI to the grid that belongs to based on its coordinates.

In [None]:
class Assign():
    def __init__(self, df, dfs):
        """same with the Map class above"""
        self.df = df
        self.min_lon = df.loc[:,'@lon'].min()
        self.max_lon = df.loc[:,'@lon'].max()
        self.min_lat = df.loc[:,'@lat'].min()
        self.max_lat = df.loc[:,'@lat'].max()
    def gridMap(self):
        """same with the Map class above except "ax.grid(True)"""
        fig, ax = plt.subplots()
        nymap = plt.imread("C:/Users/Koparan/Desktop/DataPolito/DataScienceLab/laboratory/solutions/lab5/New_York_City_Map.PNG")
        plt.imshow(nymap, zorder=0, extent = [self.min_lon,
                                              self.max_lon,
                                              self.min_lat,
                                              self.max_lat])
        ax.grid(True)
        return ax
    def getGridLoc(self):
        """getting the locations of grid intersections as x and y axes"""
        ax = self.gridMap()
        self.xGridLocs = list(ax.get_xticks())
        self.yGridLocs = list(ax.get_yticks())
        return self.xGridLocs, self.yGridLocs
    def zoneByLoc(self):
        """create a dataframe in which columns are X ticks while rows are Y ticks of the map.
        values are just numbers that show which ticks correspond to which zone number"""
        self.getGridLoc()
        totalZone = (len(self.xGridLocs)-1)*(len(self.yGridLocs)-1)
        s = np.array(range(1,totalZone+1)).reshape(len(self.yGridLocs)-1,
                                               len(self.xGridLocs)-1)
        #for x and y, less than or equal values are assigned to this node
        self.zones = pd.DataFrame(data=s,
                              index=self.yGridLocs[1:],
                              columns=self.xGridLocs[1:])
        return self.zones
    def getNYid(self,nyids):
        """This is used to eliminate rows that not belong to New York coordinates from the dataframe"""
        nyids.set_axis([self.df.columns[0]],axis=1,inplace=True)
        self.nydf = pd.merge(self.df,nyids, on=df.columns[0],how='inner')
        self.nydf.dropna(axis=0, how='all', subset=categories, inplace=True)
        return self.nydf
    def assignZone(self):
        """We assign each row in the dataframe to a cell in the grid map.
        First, we create a column 'cell_id' in the dataframe to assign related number"""
        self.nydf['cell_id'] = 0
        zones = self.zoneByLoc()
        lats = zones.index 
        lons = zones.columns
        for i in self.nydf.index:
            lat = self.nydf.loc[i,'@lat']
            lon = self.nydf.loc[i,'@lon']
            for a in lats:
                for b in lons:
                    if lat<a and lon<b:
                        self.nydf.loc[i,'cell_id']=zones.loc[a,b]
                        break
                else: #this 'else' block is executed if the 'for' loop is not terminated with 'break' statement
                    continue
                break 
        return self.nydf
        
        

In [None]:
#create an object with Assign class
c = Assign(nydf, dfs)
#execute needed methods in class to obtain a DataFrame that contain also cell_id for each row
c.zoneByLoc()
c.getNYid(ny_id)
celldf = c.assignZone()
celldf

**5.** Identify how many times a POI type is contained in each cell, for each category.

In [None]:
#the cell_id values which are also grids will be index values
ind = sorted(celldf.cell_id.unique())
#we will create a pivot table in which grids are index and POI types are columns 
pivot = pd.DataFrame(index = ind)
for i in categories:
    cellpoi = celldf[[i,'cell_id']]
    cellpoi.dropna(subset=[i],inplace=True)
    a = cellpoi['cell_id'].value_counts()
    #there will be no integration problem since index values are matches
    pivot[i] = a
print(pivot)
    

**6.** Examine the correlation between POI types 'amenity' and 'shop' based on cells

In [None]:
#this function creates a pivot table in which 'cell_id' is index and POI names are columns for selected POI type
def getCount4Zone(df, column, perc=.6):
    counts = df[column].value_counts()
    percentage = counts.cumsum() / counts.sum()
    arg = (percentage>perc).argmax()
    names = counts.index[:arg+1]
    mask = df[column].isin(names)
    df2 = df.loc[mask]
    pivot = df2.pivot_table(values='@lat',
                        index='cell_id',
                        columns=column,
                        aggfunc = 'count',
                        fill_value = 0)
                        
    return pivot

In [None]:
#create 2 different pivot tables for 'amenity' and 'shop'
amedf = getCount4Zone(celldf,'amenity')
shopdf = getCount4Zone(celldf,'shop')
#concatenate these pivot tables into one
final_df = pd.concat([amedf,shopdf], axis=1)
final_df

In [None]:
#calculate the correlation between columns
final_corr = final_df.corr()

In [None]:
#plotting heatmap of correlation
fig, ax = plt.subplots()
im = ax.imshow(final_corr)
ax.set_xticks(np.arange(final_corr.columns.size))
ax.set_yticks(np.arange(final_corr.columns.size))
ax.set_xticklabels(final_corr)
ax.set_yticklabels(final_corr)
plt.setp(ax.get_xticklabels(),rotation=90, ha = 'right', va='center',
        rotation_mode='anchor')
cbar = ax.figure.colorbar(im, ax=ax)
_ = cbar.ax.set_ylabel('pearson correlation', rotation=-90, va='bottom')

# -The End- #