## Cleaning Data

In the Web Scrapping Notebook, I did web scrapping and combined data in .CSV file named 'Data1.csv'
However, this dataset is not available in the most useful format and must be cleaned before any analysis could be performed.

Here, I have removed unnecessary strings and values from the dataset.

##### Importing Libraries

In [1]:
import re
import numpy as np
import pandas as pd

Using pd.read_csv() reading the dataset

In [2]:
dataset = pd.read_csv("Data1.csv")

By looking at each column,
found out that City column has square brackets which were actually refering to some other links in wikipedia page. We don't need that here so replacing it.

In [7]:
dataset['City']=dataset['City'].astype(str).str.replace(r"\[.*\]", "")

'Change' column has Percentage change values stored so, renaming column for better understanding

In [4]:
dataset.rename(columns={'Change': 'Percentage Change'}, inplace = True)

Here, as we have already mentioned in header about percentage change we don't need that % sign in column because it is not required in analysis. Therefore, stripped '%' from that column

In [5]:
dataset['Percentage Change'] = dataset['Percentage Change'].str.strip('%')
dataset['Percentage Change'] = dataset['Percentage Change'].str.strip()

In [8]:
dataset.head()

Unnamed: 0,2018 rank,City,State,2018 estimate,2010 census,Percentage Change,2016 land area km2,2016 population density per km2,Location,Time zone,Website
0,1,New York City,New York,8398748,8175133,2.74,780.9 km2,"10,933/km2",40°39′49″N 73°56′19″W﻿ / ﻿40.6635°N 73.9387°W﻿...,UTC−05:00 (EST),https://www.nyc.gov/
1,2,Los Angeles,California,3990456,3792621,5.22,"1,213.9 km2","3,276/km2",34°01′10″N 118°24′39″W﻿ / ﻿34.0194°N 118.4108°...,UTC−08:00 (Pacific),https://www.lacity.org/
2,3,Chicago,Illinois,2705994,2695598,0.39,588.7 km2,"4,600/km2",41°50′15″N 87°40′54″W﻿ / ﻿41.8376°N 87.6818°W﻿...,UTC−06:00 (Central),http://www.cityofchicago.org
3,4,Houston,Texas,2325502,2100263,10.72,"1,651.1 km2","1,395/km2",29°47′12″N 95°23′27″W﻿ / ﻿29.7866°N 95.3909°W﻿...,UTC−6 (CST),http://www.houstontx.gov/
4,5,Phoenix,Arizona,1660272,1445632,14.85,"1,340.6 km2","1,200/km2",33°34′20″N 112°05′24″W﻿ / ﻿33.5722°N 112.0901°...,UTC−7 (MST (no DST)),http://www.phoenix.gov


Here, 2018 estimate and 2010 census have data with ',' which is not required. So replaced that part

In [9]:
dataset['2018 estimate'] = dataset['2018 estimate'].str.replace(',', '')
dataset['2010 census'] = dataset['2010 census'].str.replace(',', '')


In [10]:
dataset.sample(2)

Unnamed: 0,2018 rank,City,State,2018 estimate,2010 census,Percentage Change,2016 land area km2,2016 population density per km2,Location,Time zone,Website
155,156,Springfield,Missouri,168122,159498,5.41,213.2 km2,785/km2,37°11′39″N 93°17′29″W﻿ / ﻿37.1942°N 93.2913°W﻿...,UTC−6 (CST),http://www.springfieldmo.gov/
85,86,Reno,Nevada,250998,225221,11.45,277.9 km2,883/km2,39°32′57″N 119°51′00″W﻿ / ﻿39.5491°N 119.8499°...,UTC−8 (Pacific (PST)),http://reno.gov


Same way, 2016 land area km2 and 2016 population density per km2 have extra unit indication placed

In [11]:
dataset['2016 land area km2'] = dataset['2016 land area km2'].str.replace('km2', '')
dataset['2016 population density per km2'] = dataset['2016 population density per km2'].str.replace('/km2', '')

In [12]:
dataset.sample(4)

Unnamed: 0,2018 rank,City,State,2018 estimate,2010 census,Percentage Change,2016 land area km2,2016 population density per km2,Location,Time zone,Website
280,281,Hillsboro,Oregon,108389,91611,+18.31,64.7,1624,45°31′41″N 122°56′09″W﻿ / ﻿45.5280°N 122.9357°...,UTC−8 (PST),http://www.hillsboro-oregon.gov/
67,68,Anchorage,Alaska,291538,291826,−0.10,4420.1,68,61°10′27″N 149°17′03″W﻿ / ﻿61.1743°N 149.2843°...,UTC-9 (AKST),http://www.muni.org/
162,163,Clarksville,Tennessee,156794,132929,+17.95,254.6,590,36°33′59″N 87°20′43″W﻿ / ﻿36.5664°N 87.3452°W﻿...,UTC−6 (CST),http://www.cityofclarksville.com/
139,140,Sioux Falls,South Dakota,181883,153888,+18.19,195.3,893,43°32′18″N 96°43′55″W﻿ / ﻿43.5383°N 96.7320°W﻿...,UTC−6 (Central),http://www.SiouxFalls.org


Here, Location data has been fetched but for comparision or any particular it might not be needed. So, dropping that column as well

In [13]:
drop_columns = ['Location']
dataset.drop(drop_columns, axis = 1, inplace = True)

Final cleaned dataset

In [14]:
dataset.to_csv("Final_data.csv", index = False)

In [15]:
dataset

Unnamed: 0,2018 rank,City,State,2018 estimate,2010 census,Percentage Change,2016 land area km2,2016 population density per km2,Time zone,Website
0,1,New York City,New York,8398748,8175133,+2.74,780.9,10933,UTC−05:00 (EST),https://www.nyc.gov/
1,2,Los Angeles,California,3990456,3792621,+5.22,1213.9,3276,UTC−08:00 (Pacific),https://www.lacity.org/
2,3,Chicago,Illinois,2705994,2695598,+0.39,588.7,4600,UTC−06:00 (Central),http://www.cityofchicago.org
3,4,Houston,Texas,2325502,2100263,+10.72,1651.1,1395,UTC−6 (CST),http://www.houstontx.gov/
4,5,Phoenix,Arizona,1660272,1445632,+14.85,1340.6,1200,UTC−7 (MST (no DST)),http://www.phoenix.gov
5,6,Philadelphia,Pennsylvania,1584138,1526006,+3.81,347.6,4511,UTC-5 (EST),http://www.phila.gov/
6,7,San Antonio,Texas,1532233,1327407,+15.43,1194.0,1250,UTC−6 (CST),http://www.sanantonio.gov/
7,8,San Diego,California,1425976,1307402,+9.07,842.3,1670,UTC−8 (Pacific),http://www.sandiego.gov/
8,9,Dallas,Texas,1345047,1197816,+12.29,882.9,1493,UTC−6 (Central),http://www.dallascityhall.com/
9,10,San Jose,California,1030119,945942,+8.90,459.7,2231,UTC−8 (Pacific Time Zone),http://www.sanjoseca.gov/
