# Final Project: A prosperous but noisy city: Noise analysis in New York

# Table of Content

# Motivation

## What is your dataset?

The dataset we used for the project is the records of 311 Service Requests, a government hotline from the NYC open data website which reflects the daily life problems of many residents. [Link](https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9) <br>The dataset includes government hotlines' records from 2010 to the present, about a decade, covering all aspects of residents' daily life.

<br>
New yorkers can complain by visiting NYC's online customer service, text messages, phone calls, skype, etc.NYC 311 dataset covers all aspects of citizen's life in New York, which can be roughly divided into the following categories: Benefit& Support, Business& Consumers, Courts& Law, Culture& Recreation, Education, Employment, Environment, Garbage& Recycling, Government& Elections, Health, Housing& Buildings, Noise, Pets,Pests& Wildlife, Public safety, Records, Sidewalks,Streets& highways, Taxes, Transportation.
<br>

NYC311's mission is to provide the public with fast, convenient city government services and information, while providing the best customer service. It also helps organizations improve the services they offer, allowing them to focus on their core tasks and manage their workloads effectively. Meanwhile, NYC 311also provides insights into improving city government through accurate and consistent measurement and analysis of service delivery.

<br>
Moreover,NYC311 is available 24 hours a day, 7 days a week, 365 days a year. 
Not only does NYC311 offer an online translation service in more than 50 languages, but users can call the 311 hotline in more than 175 languages if their language is not included.In addition, people who are deaf, hard of hearing or have language impairment can also complaint with special help such as video relay service (VRS).

<br>We believe there is a lot of information to explore in such a large and data-rich dataset.

## Why did you choose this particular dataset?

However, it was impossible for us to conduct a comprehensive analysis of this incredible hugh dataset, so after preliminary statistics, we chose the category with the most cumulative complaints over the past decade: Noise.

<br>First of all, when it comes to environmental pollution, people may first think of air, soil, water and other aspects, but noise pollution, as an invisible and intangible existence, has the same impact on us that cannot be ignored.As a serious "urban disease", noise pollution has increasingly become the focus of modern urban life. New York, as a prosperous international city, also has such problems.

<br>Moreover, We want to study the noise complaints in New York and analyze them from both spatial perspective and temporal perspective. We hope to learn something about the urban conditions, economic development, residents' living conditions and traffic conditions, etc, in the five boroughs of New York through the noise complaints. Moreover, we wonder whether noise complaints can be used to describe the overall development and brief condition of New York City over a 10-year period.

##  What was your goal for the end user's experience?

To begin with, we want to share interesting insights to the readers from noise analysis. The seemingly boring government complaints hotline actually contains many interesting insights, which not only reflect the people's life in New York, but also provide some directions and suggestions for the government to improve the city service. 
Also, via the analysis of noise complaints in NYC, we hope users could understand the characters, living habits, preferences and cultural backgrounds of the residents in the five different boroughs of New York.
<br>

Further more, we hope that readers can freely access the information they find useful through interactive map and interactive bar by reading the New York stories presented by us, which can not only increase readers' understanding but also make reading more participatory and interesting.

# Basic stats

##  Overview of the dataset

In [1]:
import pandas as pd
import numpy as np
df_origin=pd.read_csv('311-2019-all.csv')
df=pd.read_csv('311-All-Concise-with-IncidentZip.csv')

FileNotFoundError: [Errno 2] File 311-2019-all.csv does not exist: '311-2019-all.csv'

The dataset has 22.8M rows and 41 columns with size of 12GB. The dataset is shown as follows.

In [None]:
df_origin.head(10)

The attributes are shown as follows:

In [None]:
df_origin.columns

We made a bar chart to show the 15 most frequent complaint type in New York during 2010~2020 to get some inspiration.

In [None]:
import matplotlib.pyplot as plt
complaint_count=df['Complaint Type'].value_counts()
complaint_count.iloc[0:20]

title='The 15 most frequent complaint type in New York during 2010~2020'
to_display=complaint_count[0:15]
f,p=plt.subplots(figsize=(20,15))
p.bar(to_display.index,to_display.values)
p.tick_params(axis='x',labelrotation=90)
p.tick_params(labelsize=15)
p.set_title(title,fontsize=20)

From the figure, we found noise is the most reported complain type, which inspired us to discover more about it. For temporal and spatial analysis of Noise, we think only 9 attributes are relevant and retained.

In [None]:
df.columns

These attributes are used for the different purpuses.
* Created Date\Closed Date: Used for label the time of each cases, serve for temporal analysis. It is stored in String.
* Complaint Type: Main complaint types.It has 439 different values and provide a fundationtal classification of each complaint type.
* Descriptor: For some main types, people may be confused for the names are ambiguous. This is associated to the Complaint Type, and provides further detail on the incident or condition. Descriptor can be seen as a set of sub-type of each Complaint Type. It has 1168 different values.
* Location Type: Describes the type of location used in the address information. It corresponds to 'Complaint Type' as well as 'Descriptor' so that it can provide more explaination. For example, The location type, Store, corresponds to the complaint type of Noise - Commercial. It helps when the Complaint Type and Descriptor are ambiguous.
* Incident Zip: Incident location zip code. It describes the zipcode of the block where the incident took place. It contains some irrelevent information and NaN values and the method to handle with is explained in 2.2
* Borough: Name of the borough where the incident took place. It contains some irrelevent information and NaN values and the method to handle with is explained in 2.2
* Latitude/Longitude: Coordinates of the incident position.

##  Data preprocessing and cleaning 

### Datetime

Firstly, We adopt Created Data as the time when the incident happened. It has to be transformed to pandas datetime objets so that we can extract the information.

In [None]:
suitform='%m/%d/%Y %H:%M:%S %p'
df['TransCDatetime']=pd.to_datetime(df['Created Date'],format=suitform)
df['month']=[i.month+(i.year-2010)*12 for i in df['TransCDatetime']]

In [None]:
time_nan=df['TransCDatetime'].isna()
time_nan.sum()
print('The percentage of nan value of for created time is {:10.2f}%'.format(time_nan.sum()/df.shape[0]*100))

We successffully transformed the format of datatime, which indicates all the elements are valid and also no NaN value is detected in the attribute.

### Complaint type and Descriptor

For noise analysis, we will have the five following main types. We only focus on the noise types that are in the 50 top complaints type.

In [None]:
complaint_count=df['Complaint Type'].value_counts()
TOP_COMPLAINTS=50
cared=complaint_count.iloc[0:TOP_COMPLAINTS].index
Noise_type=[]
for i in cared:
    if 'oise' in i:
        Noise_type.append(i)
Noise_type

In each main type, we also have subtypes which are shown below.

In [None]:
Noise_summary=dict()
for i in Noise_type:
    temp=df[df['Complaint Type']==i]
    Noise_summary[i]=temp

for i in Noise_type:
    print('The main type is', i)
    subtype=Noise_summary[i]['Descriptor'].unique()
    for j in subtype:
        print('    The subtype is',j)

In summary, we have 5 maintypes and 36 subtypes, which are considered all main types and subtypes are valid, so that no further cleaning and processing are demanded.

### Cleaning Incident Zip and Coordinates

We created Choropleth map for distribution of noise cases acrss different blocks in 2019, by counting the number of cases for each zipcode. 

In the first place, the data quality for the ten years (2010~2020) is analyzed.

In [None]:
df['Incident Zip'].unique()

Two main problems for the attribute Zipcode have been detected:
* NaN values
* Zipcode with invalid characters,e.g. alphabet

It is necessary to figure out the the percentage of the valid values. It is calculated as follows.

In [None]:
# verify each item if they have the following problems: nan, invalid character
import re
zipnan=df['Incident Zip'].isna()
zipnan=zipnan.to_numpy()
zipalph=[]
for i in df['Incident Zip']:
    a=(re.search('[a-zA-Z]', str(i))!=None)
    b=(re.search('[-]', str(i))!=None)
    zipalph.append(a and b)
zipalph=np.array(zipalph)
percentage=zipalph.sum()+zipnan.sum()
print('The percentage of invalid value of the whole dataset is {:10.2f}%'.format(percentage/df.shape[0]*100))

The percentage of invalid values is 5.79%, which is acceptable because we mainly focus on the overall distribution and trend of some focused features.

However, in the interactive map, we presented the noise distribution in 2019 so that a particular attention should be paid to the data quality for this year.

In [None]:
df['year']=[i.year for i in df['TransCDatetime']]
df_2019=df[df['year']==2019]

In [None]:
import re
zipnan1=df_2019['Incident Zip'].isna()
zipnan1=zipnan1.to_numpy()
zipalph1=[]
for i in df_2019['Incident Zip']:
    a=(re.search('[a-zA-Z]', str(i))!=None)
    b=(re.search('[-]', str(i))!=None)
    zipalph1.append(a and b)
zipalph1=np.array(zipalph)
percentage=zipalph1.sum()+zipnan1.sum()
print('The percentage of invalid value for 2019 is {:10.2f}%'.format(percentage/df_2019.shape[0]*100))

We have seen that it is of better quality compared to the dataset(3.16% of 2019 to 5.79% to 2010~2020), which indicates improvement in data collection by the government. 

But we still want to do correction to the invalid values for 2019. K-nearest-neighbours(KNN) is the machine learning algorithm can be adopted for this problem because the zipcode is determined by coordinates of the point. Therefore, the first thing came to our mind is that the probability of invalid coordinate given invalid zipcode because zipcode should be predicted based on coordinates.

Here, outliers in coordinates are detected with boxplot.

In [None]:
a=df_2019['Latitude'].isna() & df_2019['Longitude'].isna()
b=df_2019['Latitude'].isna()
print('Total number of NaN in Latitude is {}'.format(a.sum()))
print('Total number of NaN in Latitude or Longitude is {}'.format(b.sum()))

The two numbers are equal, which means that if NaN is present in Latitude, it is also NaN in the correspoding longitude.

In [None]:
f,p=plt.subplots(1,2,sharex=True,figsize=(20,5))
font=18
#titledict={'x':0.02,'y':0.9}
p[0].set_title('Latitude of noise cases',fontsize=font)
p[0].boxplot(df_2019[~b]['Latitude'])
p[0].tick_params(labelsize=font)
p[1].set_title('Longitude of noise cases',fontsize=font)
p[1].boxplot(df_2019[~b]['Longitude'])
p[1].tick_params(labelsize=font)

After removing the NaN values, all the cocordinates are in the right range. We considered no other outliers included.

In [None]:
latnan1=b
latnan1=latnan1.to_numpy()
print('The percentage of invalid value of coordinates for 2019 is {:10.2f}%'.format(latnan1.sum()/df_2019.shape[0]*100))

The percentage of invalid values is 5.31%. And then we are going to calculate the probability of invalid coordinate given invalid zipcode.

In [None]:
notused=0
for i in range(df_2019['Incident Zip'].shape[0]):
    if latnan1[i] and zipnan1[i] and ~zipalph1[i]:
        notused+=1
print('The percentage of invalid coordinate given invalid zipcode{:10.2f}%'.format(notused/percentage*100))

It means that for the invalid zip code, it is 99.83% likely not having its coordinates. Therefore KNN will not be effective and it is also inferred that if the government did not record the zipcode, they also did not get the position of the case. 
However, in the interactive map, we presented the noise distribution in 2019 so that a particular attention should be paid to the data quality for this year.

Based on above analsis, we discarded the invalid values for zipcode and it will not have great effect on the analysis result.

### Borough

We create a intearactive bar chart displaying distributions of various noise types in different boroughs.

In the first place, the data quality for the ten years (2010~2020) is analyzed.

In [None]:
df['Borough'].unique()

It is shown that the invalid value is 'Unspecified', for which we have calculated its percentage in the whole dataset.

In [None]:
unspecified_whole=(df['Borough']=='Unspecified')
print('The percentage of invalid value of the whole dataset is {:10.2f}%'.format(unspecified_whole.sum()/df.shape[0]*100))

The percentage of invalid values is 5.35%, which is acceptable to discard the invalid values because we mainly focus on the overall distribution and trend of some focused features.
However, in the interactive bar chart, we presented distributions of various noise types in different boroughs in 2019 so that a particular attention should be paid to the data quality for this year.

In [None]:
unspecified_2019=(df_2019['Borough']=='Unspecified')
print('The percentage of invalid value of the whole dataset is {:10.2f}%'.format(unspecified_2019.sum()/df_2019.shape[0]*100))

We have seen that it is of better quality compared to the dataset(0.91% of 2019 to 5.35% to 2010~2020), which indicates improvement in data collection by the government.
As for our analysis, We discarded the unspeicifed value and it will not have a great influence on our analysis result.

###  Summary of the dataset after cleaning and preprocessing

Because the dataset covers a great number of complaint types, it is necessary to narrow it down to the main ones to obtain the main trends and features of noise in the New York city. After data cleanning and preprocessing, the dataset only contains the necessary attributes for the report. The datasize has 22662415 rows and 10 colomns (of original attributes).

In [None]:
df.head(10)

#  Data analysis

## The proportion of noise out of the whole cases.

In [None]:
count=0
for i in df['Complaint Type']:
    if 'oise' in i:
        count+=1
print('The percentage of noise out of the whole dataset is {:10.2f}%'.format(count/df.shape[0]*100))

##  Sum up main types and sub types.

In [None]:
main_noise=df[df['Complaint Type'].str.contains('oise', regex=False)]
counts=main_noise['Complaint Type'].value_counts()
counts=counts.iloc[0:5,]

In [None]:
plt.figure(figsize=(12,8))
counts.plot(kind='bar')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('The sum of each main type (The 5 most frequently)',fontsize=15)

The most frequently main type is Noise - Residiential, which shows that the noise cases rae mostly reported by the residents. Below, we also plot the 15 most frequently subtypes.

In [None]:
sub_noise=main_noise['Descriptor'].value_counts()
plt.figure(figsize=(12,8))
sub_noise.plot(kind='bar')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('The sum of each subtype (The 15 most frequently)',fontsize=15)

## The proportion of the considred noise cases out of the whole noise cases.

In [None]:
counts.sum()/count

## Plotting the monthly trend of main types

In [None]:
f,p=plt.subplots(len(Noise_type),figsize=(60,200))
m=0
month_range=np.arange(df['month'].min(),df['month'].max()+1)
month_range_scarce=np.arange(df['month'].min(),df['month'].max()+1,5)
for i in Noise_type:
    monthly=pd.Series(np.zeros(len(month_range)+1),dtype='int32')
    drawn=df[df['Complaint Type']==i]['month'].value_counts()
    print('I am doing ', i)
    for j in drawn.index:
        monthly.loc[j]=drawn[j]
    p[m].bar(month_range,monthly[month_range])
    p[m].set_title(i,size=60)
    p[m].tick_params(axis='x',labelrotation=90)
    p[m].set_ylim(0,1.2*monthly.max(axis=0))
    p[m].tick_params(labelsize=30)
    p[m].set_xticks(month_range)
    m+=1

We have observed that for the five main crime types, they all show an increasing trend from 2010 to 2020 and seasonal fluctuation.

We can obtain more information if the monthly trend of each subtype is plotted.

## Plotting the monthly trend of sub types

In [None]:
# for i in Noise_type:
#     m=0
#     subtype=Noise_summary[i]['Descriptor'].unique()
#     print('Len of subtype',len(subtype))
#     f,p=plt.subplots(len(subtype),figsize=(60,200))
#     plt.subplots_adjust(hspace = 0.4)
#     for j in subtype:
#         monthly=pd.Series(np.zeros(len(month_range)+1),dtype='int32')
#         drawn=Noise_summary[i][Noise_summary[i]['Descriptor']==j]['month'].value_counts()
#         print('I am doing ',i,j)
#         for k in drawn.index:
#             monthly.loc[k]=drawn[k]
# #        print(monthly[month_range])
#         p[m].bar(month_range,monthly[month_range])
#         p[m].set_title(i+':  '+j,size=60)
#         p[m].tick_params(axis='x',labelrotation=90)
#         p[m].set_ylim(0,1.2*monthly.max(axis=0))
#         p[m].tick_params(labelsize=30)
#         p[m].set_xticks(month_range_scarce)
#         m+=1

m=0
n=0   
f,p=plt.subplots(18,2,figsize=(60,100))
for i in Noise_type:  
    subtype=Noise_summary[i]['Descriptor'].unique()
#    print('Len of subtype',len(subtype))
#     if len(subtype)%2==1:
#         rows=len(subtype)//2+1
#     else:
#         rows=len(subtype)//2
    
    plt.subplots_adjust(hspace = 0.4)
    for j in subtype:
        monthly=pd.Series(np.zeros(len(month_range)+1),dtype='int32')
        drawn=Noise_summary[i][Noise_summary[i]['Descriptor']==j]['month'].value_counts()
#        print('I am doing ',i,j)
        for k in drawn.index:
            monthly.loc[k]=drawn[k]
#        print(monthly[month_range])
#        print(m,n)
        p[m][n].bar(month_range,monthly[month_range])
        p[m][n].set_title(i+':  '+j,size=30)
        p[m][n].tick_params(axis='x',labelrotation=90)
        p[m][n].set_ylim(0,1.2*monthly.max(axis=0))
        p[m][n].tick_params(labelsize=30)
        p[m][n].set_xticks(month_range_scarce)
        n+=1
        if n==2:
            m+=1
            n=0

After initial analysis, we focuses only on the subtype of noise with complete data (all available from 2010 to 2020). Generally they show the seasonal pattern of more cases in the summer while less in the winter. Besides that, we sorted them subtypes into three catogories in terms of overall trend.
* Ascending trend：most of the subtypes are in ascending trend, mostly relevant to human activity. e.g. Loud Music/Party, Loud Talking.
* Stable: only a few, mostly irrelevant to human activities, e.g. Barking Dog.
* Dscending trend: only one, Jack Hammering.

## Analysis of coordinates distribution

In [None]:
from scipy.stats import gaussian_kde
main_noise=main_noise[~np.isnan(main_noise['Latitude'])]
font=18
# histogram
f,p=plt.subplots(2,1,figsize=(10,8))
f.tight_layout(pad=3.0)
p[0].hist(main_noise['Latitude'],bins=50,alpha=0.75,edgecolor = 'white', linewidth = 1.2)
p[0].tick_params(labelsize=font)
p[0].set_title('Histogram and KDE of Latitude',fontsize=font)
# KDE
density = gaussian_kde(main_noise['Latitude'])
m,n=np.histogram(main_noise['Latitude'],bins=50)
p[1].plot(n,density(n))
p[1].tick_params(labelsize=font)


In [None]:
f,p=plt.subplots(2,1,figsize=(10,8))
f.tight_layout(pad=3.0)
p[0].hist(main_noise['Longitude'],bins=50,alpha=0.75,edgecolor = 'white', linewidth = 1.2)
p[0].tick_params(labelsize=font)
p[0].set_title('Histogram and KDE of Longitude',fontsize=font)
# KDE
density = gaussian_kde(main_noise['Longitude'])
m,n=np.histogram(main_noise['Longitude'],bins=50)
p[1].plot(n,density(n))
p[1].tick_params(labelsize=font)

Based on the histogram, we observed how the coordinates are distributed and it fits the territorial shape of New York city.

## If Relevant talk about your machine leanrning.

For this project, the focus is about statistical analysis, visualization and story-telling. No machine learning problems are involved in the analysis, except the case that we planned to use K-nearest-neighbours to make some correction for the default or invalid values in the attribute 'Incident Zip'. As it is described in the data cleaning section, it is impossible to implement KNN for mostly both coordinates and zipcode are missing at the same time while other attributes are considered irrelevant.

#  Genre

## Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal & Heer). Why?

For visual narrative, we chose the interactive slideshow, which we thought would be a good way to balance author-driven and reader-driven stories. There is an overall time narrative structure (e.g., slideshow), however, at some point, the user can manipulate the interaction visualization(interactive map and interactive bar in this project) to see more detailed information so that the reader can better understand the pattern or extract more relevant information (e.g., via interacting with a slideslideshow). Readers should also be able to control the reading progression themselves.For highlighting, zooming is conducted by us, readers can further explore the details that arouse their interests.

## Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal & Heer) Why?

Linear ordering is selected by us in order to form a complete story line, hover details and selection are conducted in interactive parts. We maintain these can increase the reader's sense of participation and interactivity in reading. In the messaging section, headlines, annotations,introductry and summary are used. The headline give the readers the guidance about the specific content of the article while the annotation help readers get more information description.The introduction plays the role of arousing readers' interest and attracting them to further reading, while the summary conclude the content and stimulate readers' thinking, both of which give readers have a complete concept of the whole story.

#  Visualizaition

## Explain the visualizations you've chosen.

* Interactive choropleth map for distribution of noise cases acrss different blocks

It is an interactive choropleth map which shows not only overall distribution of the reported cases but also detailed information of each block. 

The color of one block indicates how many repored noise cases per hectare in it and readers can easily get a good understanding of the overall distribution with reference to the color bar. 

Besides, when you put your mouse on a maker and click it, you will get the zip number, block name and the number of cases per hectare. 

* Distributions of various noise types in different boroughs

It is an interactive bar that shows the distribution of top ten noise subtypes in the five boroughs of New York.

We sorted out the top 10 sub-noise types in terms of frequency and calculatd the percentage for each borough. The x axis presents the 10 noise type while the y axis illustrates the percentage for each borough. When the mouse is moved onto the bar, it shows the accruate value of percentage.

## Why are they right for the story you want to tell?

From interactive choropleth map and bar chart, readers can get a general understanding of the problem but also the detailed information as their interest. Also, we provide our own story line to tell the readers what we have found and want them to know and use necessary supplementary material (Image) to help readers better understand. These storyline origniates from the phenomenon presented in the interactive visualization. Therefore, we think they are the right tools for the report.

#  Discussion

## What went well?

* The outliers and invalid values in the dataset but they consistute quite a small proportion(less than 5%) of the data we are concered. 
* All the codes work well and the result fits our genenral understanding of the problem but also the relevant information we obtained from the Internet. 
* We also find the right visualization tool to present our ideas.

## What could be improved? Why?

* The interactive choropleth map is divided by blocks of different zipcode. We have observed that the size varies a lot across the blocks. All the data were sorted into some large block, which has resulted in the weakness that people cannot observe the distribution in the large block. We noticed that when we zoom in Manhattan and found some small blocks with high density and then realized that the uneven distribution in the large block was ignored. Heat map can be used to solve this problem but it cannot provide detailed information that we wanted to present to the readers. We consider the interactive 

* Our analysis was conducted by finding the information we thought related to the phenomenon. It has explained something but in some cases we are not able to know if it is the cause. We believe more exploration into some problems are worthy and more information and advanced mathematical tools are demanded.

* There may be other interesting aspect of data that deserves to be explored. Heat water problem is the second most frequently reported category, which may also contain some interesting insight. Also, the relationship between differen noise types is also worthy to explore. But we think it is not very relevant to the storyline in the report.

#  Contribution