# Exploratory data analysis of Cities Data in Python.

## Let us understand how to explore the data in python.


## Introduction

**What is Exploratory Data Analysis ?**

Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their main characteristics often plotting them visually. This step is very important especially when we arrive at modeling the data in order to apply Machine learning. Plotting in EDA consists of Histograms, Box plot, Scatter plot and many more. It often takes much time to explore the data. Through the process of EDA, we can ask to define the problem statement or definition on our data set which is very important.

**What data are we exploring today ?**

Challenge 1: City-Business Collaboration

In order to connect companies and cities together, we need to first understand how the data they each report is aligned or divergent by using data science and text analytics techniques. Then, we can begin to identify where there might be opportunities for collaboration and co-investment for mutual sustainability goals that will ultimately benefit all citizens of the city.

**Required Output**: Use of CDP data and ANY company data, including but not limited to CDP company data:

CDP City Data [LOCATION: City Data MSTeams] Source: public disclosures from North American cities Datapoints: population: 0.5, city emission reduction targets

CDP Company Data: [LOCATION: City Data MSTeams] Source: public disclosures from North American companies Datapoints: targets module C4.1-C4.3; Emissions data module C6.1, 6.3, 6.5, 6.10; Risks and opportunities module C2.2-C2.5

Bonus: enrich your information with publicly available external data sets such as city-level information on electric power users, employers, planned economic investment, business registries, corporate city taxpayers, members of local chambers of commerce, CBCA which is a consortium of CDP, WBCSD and C40



---



## 1. Importing the required libraries for EDA

Below are the libraries that are used in order to perform EDA (Exploratory data analysis) in this tutorial.

In [66]:
import pandas as pd
import numpy as np
import seaborn as sns                       #visualisation
import matplotlib.pyplot as plt             #visualisation
%matplotlib inline     
sns.set(color_codes=True)



---



## 2. Loading the data into the data frame.

Loading the data into the pandas data frame is certainly one of the most important steps in EDA, as we can see that the value from the data set is comma-separated. So all we have to do is to just read the CSV into a data frame and pandas data frame does the job for us.

To get or load the dataset into the notebook, place the data into the same directory as your Jupyter Notebook file. Then, open a *shell* in the same directory and run the command "jupyter notebook". After the notebook has started, check that you're using the correct kernal/environment in the top right corner.

In [67]:
import chardet
def find_encoding(fname):
    r_file = open(fname, 'rb').read()
    result = chardet.detect(r_file)
    charenc = result['encoding']
    return charenc

file = "Cities_Data_2017-2019_mb2.csv"
my_encoding = find_encoding(file)
df = pd.read_csv(file, encoding=my_encoding)
# To display the top 5 rows 
df.head(5)

Unnamed: 0,Project Year,Account Number,Account Name,Question Number,Question Name,Row Number,Row Name,Column Number,Column Name,Response Answer,Comments,File Name
0,2017,63999,"City of Miami Beach, FL",0.1,Please give a general description and introdu...,1,,C1,Administrative boundary,City/Municipality,,
1,2017,62864,"City of Lancaster, PA",0.1,Please give a general description and introdu...,1,,C1,Administrative boundary,City/Municipality,,
2,2017,61790,"City of Emeryville, CA",0.1,Please give a general description and introdu...,1,,C1,Administrative boundary,City/Municipality,,
3,2017,58485,Abington Township,0.1,Please give a general description and introdu...,1,,C1,Administrative boundary,City/Municipality,,
4,2017,54102,City of Albany,0.1,Please give a general description and introdu...,1,,C1,Administrative boundary,City/Municipality,,


In [68]:
df.tail(5)                        # To display the botton 5 rows

Unnamed: 0,Project Year,Account Number,Account Name,Question Number,Question Name,Row Number,Row Name,Column Number,Column Name,Response Answer,Comments,File Name
135506,2019,35894,Ville de Montreal,5.0a,Please provide details of your total city-wide...,1,,12,Please indicate to which sector(s) the target ...,Industrial facilities,,
135507,2019,35894,Ville de Montreal,5.0a,Please provide details of your total city-wide...,1,,13,Does this target align to a requirement from a...,No,,
135508,2019,35894,Ville de Montreal,5.0a,Please provide details of your total city-wide...,1,,14,Please describe your target. If your country h...,Baseline emissions have been updated. Please r...,,
135509,2019,35894,Ville de Montreal,8.0b,Please explain why you do not have a renewable...,1,Please explain,1,Reasoning,Other,,
135510,2019,35894,Ville de Montreal,8.0b,Please explain why you do not have a renewable...,1,Please explain,2,Comment,Montreal's electrical energy source is mostly ...,,




---



## 3. Checking the types of data

Here we check for the datatypes because sometimes numbers are stored as a strings.  If in that case, we have to convert to plot the data via a graph. 
object types refer to mixed types or variable length strings. See [Pandas types](https://pbpython.com/pandas_dtypes.html).

In [69]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135511 entries, 0 to 135510
Data columns (total 12 columns):
Project Year       135511 non-null int64
Account Number     135511 non-null int64
Account Name       135509 non-null object
Question Number    135511 non-null object
Question Name      135510 non-null object
Row Number         135511 non-null int64
Row Name           8932 non-null object
Column Number      132005 non-null object
Column Name        130376 non-null object
Response Answer    106365 non-null object
Comments           4012 non-null object
File Name          371 non-null object
dtypes: int64(3), object(9)
memory usage: 12.4+ MB




---



## Cleaning Up the Data

### Dropping and Renaming Columns

This step is certainly needed in every EDA because sometimes there would be many columns that we never use in such cases dropping is the only solution. In this case, the columns such as Row Number or Column Number doesn't make any sense to me so I just dropped for this instance.

In [70]:
df = df.drop(["Row Number", "Column Number"], axis=1)
df.tail(5)

Unnamed: 0,Project Year,Account Number,Account Name,Question Number,Question Name,Row Name,Column Name,Response Answer,Comments,File Name
135506,2019,35894,Ville de Montreal,5.0a,Please provide details of your total city-wide...,,Please indicate to which sector(s) the target ...,Industrial facilities,,
135507,2019,35894,Ville de Montreal,5.0a,Please provide details of your total city-wide...,,Does this target align to a requirement from a...,No,,
135508,2019,35894,Ville de Montreal,5.0a,Please provide details of your total city-wide...,,Please describe your target. If your country h...,Baseline emissions have been updated. Please r...,,
135509,2019,35894,Ville de Montreal,8.0b,Please explain why you do not have a renewable...,Please explain,Reasoning,Other,,
135510,2019,35894,Ville de Montreal,8.0b,Please explain why you do not have a renewable...,Please explain,Comment,Montreal's electrical energy source is mostly ...,,




---



In [71]:
df = df.rename(columns={"Account Number": "Account ID", "Question Number" : "Question ID"})
df.head(5)

Unnamed: 0,Project Year,Account ID,Account Name,Question ID,Question Name,Row Name,Column Name,Response Answer,Comments,File Name
0,2017,63999,"City of Miami Beach, FL",0.1,Please give a general description and introdu...,,Administrative boundary,City/Municipality,,
1,2017,62864,"City of Lancaster, PA",0.1,Please give a general description and introdu...,,Administrative boundary,City/Municipality,,
2,2017,61790,"City of Emeryville, CA",0.1,Please give a general description and introdu...,,Administrative boundary,City/Municipality,,
3,2017,58485,Abington Township,0.1,Please give a general description and introdu...,,Administrative boundary,City/Municipality,,
4,2017,54102,City of Albany,0.1,Please give a general description and introdu...,,Administrative boundary,City/Municipality,,




---



### Dropping the duplicate rows

This is often a handy thing to do because duplicate rows often don't add useful information to a dataset.

In [72]:
duplicate_rows_df = df[df.duplicated()]
print("number of duplicate rows: ", duplicate_rows_df.shape)
df.count()      # Used to count the number of rows

number of duplicate rows:  (47474, 10)


Project Year       135511
Account ID         135511
Account Name       135509
Question ID        135511
Question Name      135510
Row Name             8932
Column Name        130376
Response Answer    106365
Comments             4012
File Name             371
dtype: int64

In [73]:
df = df.drop_duplicates()
df.head(5)

Unnamed: 0,Project Year,Account ID,Account Name,Question ID,Question Name,Row Name,Column Name,Response Answer,Comments,File Name
0,2017,63999,"City of Miami Beach, FL",0.1,Please give a general description and introdu...,,Administrative boundary,City/Municipality,,
1,2017,62864,"City of Lancaster, PA",0.1,Please give a general description and introdu...,,Administrative boundary,City/Municipality,,
2,2017,61790,"City of Emeryville, CA",0.1,Please give a general description and introdu...,,Administrative boundary,City/Municipality,,
3,2017,58485,Abington Township,0.1,Please give a general description and introdu...,,Administrative boundary,City/Municipality,,
4,2017,54102,City of Albany,0.1,Please give a general description and introdu...,,Administrative boundary,City/Municipality,,


In [74]:
duplicate_rows_df = df[df.duplicated()]
print("NEW number of duplicate rows: ", duplicate_rows_df.shape)
df.count()      # Used to count the number of rows

NEW number of duplicate rows:  (0, 10)


Project Year       88037
Account ID         88037
Account Name       88035
Question ID        88037
Question Name      88036
Row Name            8932
Column Name        82902
Response Answer    74198
Comments            2536
File Name            368
dtype: int64



---



# Exploring the Data


### Find and Transform Unique Data
Once found, the data can then be transformed into an form that makes more sense to process/analyse

In [116]:
# Select a column name to view unique values for that Column
print("Unique Values:")
uniques = {}
for col in list(df.columns):
    uniques[str(col)] = pd.unique(df[str(col)])
    print(len(uniques[str(col)]), "\t", col,)

Unique Values:
3 	 Project Year
219 	 Account ID
224 	 Account Name
81 	 Question ID
101 	 Question Name
30 	 Row Name
252 	 Column Name
20905 	 Response Answer
227 	 Comments
328 	 File Name


In [122]:
# Print a sample overview of the data
for col in uniques.keys():
    print(col, "\n", uniques[col][:3], "...")


Project Year 
 [2017 2018 2019] ...
Account ID 
 [63999 62864 61790] ...
Question ID 
 ['0.1' '0.2' '0.3'] ...
Account Name 
 ['City of Miami Beach, FL' 'City of Lancaster, PA'
 'City of Emeryville, CA'] ...
Column Name 
 [' Administrative boundary' ' Description of city'
 ' Government , Community, None'] ...
Comments 
 [nan
 'Arlington County Board members (5 elected at-large) serve 4-year terms. The Chairmanship rotates among the five elected Board members, a new Chair every January 1 for the following 12 months.'
 'Strategies have been defined in the greenhouse gas inventory report to reduce emissions County-wide'] ...
File Name 
 [nan ' vulnerability-assessment (1).pdf'
 ' BrowardCAPReport2015_FINAL DRAFT_01252016.pdf'] ...
Response Answer 
 ['City/Municipality' 'County' 'Other: Consolidated City County Government'] ...
Row Name 
 [nan 'City boundary' 'Please complete'] ...
Question Name 
 [' Please give a general description and introduction to your city including your city\x92s b

In [129]:
# Zoom in on the unique values of a paticular column
col = "Question Name"
items = 10
print(col, items, "/", len(uniques[col]), "\n",)
print(uniques[col][:items])


Question Name 10 / 101 

[' Please give a general description and introduction to your city including your city\x92s boundary in the text box below. '
 ' Emissions Accounting ChoiceReporting emissions is optional for all cities. By checking the boxes below you are indicating that you have fuel and/or greenhouse gas (GHG) emissions data to report at this time.\xa0Select \x91Government\x92 to report emissions from your local government operations (sometimes referred to as \x91corporate\x92 or \x91municipal\x92 emissions).Select \x91Community\x92 to report emissions from the entire city area over which the city government can exercise a degree of influence through the policies and regulations they implement (sometimes referred to as \x91geographic\x92 or \x91city wide\x92 emissions).Select both boxes to report fuel and/or emissions for both inventories.\xa0IF YOU HAVE NO FUEL AND/OR GREENHOUSE GAS EMISSIONS TO REPORT DO NOT CHECK EITHER BOX.'
 " Please provide information about your city'

## Detecting Outliers

An outlier is a point or set of points that are different from other points. Sometimes they can be very high or very low. It's often a good idea to detect and remove the outliers. Because outliers are one of the primary reasons for resulting in a less accurate model. Hence it's a good idea to remove them. 

### Histogram

Histogram refers to the frequency of occurrence of variables in an interval. 

In [76]:
# df.Make.value_counts().nlargest(40).plot(kind='bar', figsize=(10,5))
# plt.title("Number of cars by make")
# plt.ylabel('Number of cars')
# plt.xlabel('Make');
pass

**Hence the above are some of the steps involved in Exploratory data analysis, these are some general steps that you must follow in order to perform EDA. There are many more yet to come but for now, this is more than enough idea as to how to perform a good EDA given any data sets. Stay tuned for more updates.**

## Thank you.