# Segmenting and Clustering Neighborhoods in Atlanta

## 1. Introduction/Business Problem 

An entrepreneur who owns a fast food restaurant wants to open the second branch of his restaurant in Atlanta. Since he wants to increase his profit, he must open his restaurant in a crowded population and in a neighborhood with low competition in this sector.To find a solution, he applies to a consulting firm who can help with this. I work as a data analyst in the information technology department of this company. In this project, I will try to analyze data that I have and find the most effective solution by using the machine learning clustering algorithm which is 'k-means' to solve the problem of our customer. First, in data processing part of this project, I will determine the top 10 most crowded neighborhoods of Atlanta by cleaning my data. Next, I will visualize my data using Folium library and analyze it.

## 2. Data Section

In this section, I will build the code to scrape the Wikipedia page which is https://en.wikipedia.org/wiki/Table_of_Atlanta_neighborhoods_by_population

In [1]:
pip install wikipedia

Note: you may need to restart the kernel to use updated packages.


### Tranform the data into a pandas dataframe

In [2]:
import pandas as pd
import wikipedia as wp
 
#Get the html source
html = wp.page("Table of Atlanta neighborhoods by population").html().encode("UTF-8")
neighborhoods_Atlanta = pd.read_html(html)[0]
neighborhoods_Atlanta.to_csv('beautifulsoup_pandas.csv',header=0,index=False)
print (neighborhoods_Atlanta)

              Neighborhood  Population (2010) NPU
0               Adair Park               1331   V
1               Adams Park               1763   R
2               Adamsville               2403   H
3              Almond Park               1020   G
4              Ansley Park               2277   E
..                     ...                ...  ..
156       Westwood Terrace                733   I
157  Whittier Mill Village                617   D
158               Wildwood               1840   C
159    Wilson Mill Meadows               1096   H
160       Wisteria Gardens                512   H

[161 rows x 3 columns]


In [3]:
neighborhoods_Atlanta.head(10)

Unnamed: 0,Neighborhood,Population (2010),NPU
0,Adair Park,1331,V
1,Adams Park,1763,R
2,Adamsville,2403,H
3,Almond Park,1020,G
4,Ansley Park,2277,E
5,Ardmore,756,E
6,Argonne Forest,590,C
7,Arlington Estates,776,P
8,Ashview Heights,1292,T
9,Atlanta University Center,5703,T


I will remove the column which is 'NPU' using 'drop' function because I don't need this information. 

In [4]:
neighborhoods_Atlanta.drop(['NPU'], axis=1,inplace= True)

In [5]:
doubled = neighborhoods_Atlanta['Neighborhood'].unique().shape
if (neighborhoods_Atlanta.shape[0]==doubled[0]):
     print ('Neighborhood is OK, none of its values is doubled')
else:
     print ('some incongruences found, please check consistency')

Neighborhood is OK, none of its values is doubled


In [6]:
neighborhoods_Atlanta.shape

(161, 2)

I need to top 10 crowded neighborhoods of Atlanta.So I will sort my neighborhood data by population.

In [7]:
neighborhoods_Atlanta_firstten=neighborhoods_Atlanta.sort_values(by='Population (2010)', ascending=False)

In [8]:
neighborhoods_Atlanta_firstten.head(10)

Unnamed: 0,Neighborhood,Population (2010)
95,Midtown,16569
51,Downtown,13411
104,Old Fourth Ward,10505
101,North Buckhead,8270
119,Pine Hills,8033
98,Morningside/Lenox Park,8030
149,Virginia-Highland,7800
66,Grant Park,6771
64,Georgia Tech,6607
80,Kirkwood,5897


I manually created csv file for top 10 crowded neighborhoods with coordinates.

In [9]:
import pandas
df_coordinates= pandas.read_csv('neigborhoods_atlanta_coordinates.csv')
print(df_coordinates)

             Neighborhood  Population (2010)   Latitude  Longitude
0                 Midtown              16569  33.783020 -84.382332
1                Downtown              13411  33.921520 -84.381912
2         Old Fourth Ward              10505  33.766430 -84.370407
3          North Buckhead               8270  33.852700 -84.365400
4              Pine Hills               8033  33.838715 -84.350830
5  Morningside/Lenox Park               8030  33.796200 -84.359500
6       Virginia-Highland               7800  33.781700 -84.363500
7              Grant Park               6771  33.737200 -84.368200
8            Georgia Tech               6607  33.775600 -84.396300
9                Kirkwood               5897  33.753300 -84.326200


In [10]:
df_coordinates.head(10)

Unnamed: 0,Neighborhood,Population (2010),Latitude,Longitude
0,Midtown,16569,33.78302,-84.382332
1,Downtown,13411,33.92152,-84.381912
2,Old Fourth Ward,10505,33.76643,-84.370407
3,North Buckhead,8270,33.8527,-84.3654
4,Pine Hills,8033,33.838715,-84.35083
5,Morningside/Lenox Park,8030,33.7962,-84.3595
6,Virginia-Highland,7800,33.7817,-84.3635
7,Grant Park,6771,33.7372,-84.3682
8,Georgia Tech,6607,33.7756,-84.3963
9,Kirkwood,5897,33.7533,-84.3262


By processing the data, I gathered the data I needed in a table and made it ready for visualization and analysis. In the methodology section, I will make more detailed inferences using this data frame.