##### IMP-PCBA

# Required discussion 2.6: A day in the life of a business analyst
---
**Expected time: 3 hours**

## Overview

Imagine you are a business analyst who has been asked to identify a prime location for a takeaway burrito restaurant in London. You want to be sure to pick a location that will maximise profits. You know there are many factors to consider, including competition from other restaurants and takeaways, but there are many other factors, such as income and population density, that you know are also worth considering. It seems there is so much to consider; however, as a business analyst, you know that you can follow a set procedure to reach a sound decision about where to place your business in London. 

The following Jupyter Notebook will take you through this procedure one step at a time.

It should be noted at this point that the process of finding the best location for a business is actually a complex task and requires some information that is often only commercially available. For these reasons, you will only use publicly available data sets. Another thing to note is that you will only interact with cleaned data frames for now. In reality, business analysts often spend about 70 per cent of their time finding, cleaning, reformatting and feature adapting the data. The purpose of this activity, however, is simply to give you a glimpse into how a business analyst may tackle a project and the tools they would use.

In addition to finding the best place to start your business, you also want to produce the most popular burritos in London. To this end, you are going to develop a classification machine learning model that will predict which types of ingredients and other factors are important in creating a ‘great’ burrito. In order to do this, you will have to use pre-existing data sets available on the internet.

Throughout this Jupyter Notebook, you will be encouraged to answer questions about what you think the data is showing you and what you might want to do next. Do your best to examine the data frames and plots and draw inferences from what you see. The steps you will be shown are intended to show you the general flow of a data science project.

### Learning outcome addressed
- Describe the role of a data analyst.


## Selecting a takeaway restaurant location

To begin, you know you will have to find all the relevant data that you think will help you in deciding on a location for your burrito takeaway restaurant. There are many sources for data on restaurant types and numbers in the different boroughs of London, but some are commercial and cost money to obtain. So, to start with, before going down the commercial path, you decide to search through government websites for London. 

The first thing you want to do is get a visual of the London boroughs so you have a feel of location and can use it to refer back to during the analysis.

![London Boroughs](Greater-London-boroughs-location-of-the-central-London-measurement-site-cross-and.png)

You can see that there are 33 boroughs. 
Next, you want to research on the internet to find the data you need. 

Question: As a business analyst, you need to find data sets that contain the information that will help you make decisions. To keep things simple, you have decided to use only publicly available data sets as an initial step in deciding on a good location for your business. What sort of information do you think may be useful in helping you make this decision?

After careful thought, you decide that the following information would be helpful as an initial guide:
- Income, as it will be useful to see the relationship to number of takeaways.
- Population and population density, as these may also show a relationship to the number of takeaways.
- Number and density of takeaways in a borough, as these will allow you to assess the competition.

You find the following government websites and data sets:

- https://data.london.gov.uk/dataset/pubs-clubs-restaurants-takeaways-borough
- https://data.london.gov.uk/dataset/daytime-population-borough
- https://data.london.gov.uk/dataset/land-area-and-population-density-ward-and-borough
- https://www.data.gov.uk/dataset/5c4a083f-a8c6-42d8-ad40-36a9719a634c/household-income-estimates-for-small-areas/datafile/e55ef2c8-dbc1-43e0-8910-76ca27282009/preview


After some reformatting of the information in the files you downloaded, you set about starting your most important first step: exploratory data analysis (EDA).

However, before doing any graphical analysis, you need to look at the data to see what it contains and then see whether any cleaning of the data is needed. Cleaning data means filling in missing data, dropping irrelevant rows or columns that do not have useful data, or removing missing data. You may also have to rename items to create ease of manipulation and make things simple and consistent with other data sets.

You decide to start with the data file containing the data set on takeaway restaurants in the boroughs (***take-away-borough2.csv***).  The table below shows the first 10 rows of the data set.  To examine the entire file, download the file and open it using a text editor or Microsoft Excel.

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>PostCode</th>      <th>Borough</th>      <th>Number</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>E09000001</td>      <td>City of London</td>      <td>155</td>    </tr>    <tr>      <th>1</th>      <td>E09000002</td>      <td>Barking and Dagenham</td>      <td>135</td>    </tr>    <tr>      <th>2</th>      <td>E09000003</td>      <td>Barnet</td>      <td>180</td>    </tr>    <tr>      <th>3</th>      <td>E09000004</td>      <td>Bexley</td>      <td>125</td>    </tr>    <tr>      <th>4</th>      <td>E09000005</td>      <td>Brent</td>      <td>155</td>    </tr>    <tr>      <th>5</th>      <td>E09000006</td>      <td>Bromley</td>      <td>175</td>    </tr>    <tr>      <th>6</th>      <td>E09000007</td>      <td>Camden</td>      <td>225</td>    </tr>    <tr>      <th>7</th>      <td>E09000008</td>      <td>Croydon</td>      <td>245</td>    </tr>    <tr>      <th>8</th>      <td>E09000009</td>      <td>Ealing</td>      <td>160</td>    </tr>    <tr>      <th>9</th>      <td>E09000010</td>      <td>Enfield</td>      <td>165</td>    </tr>  </tbody></table>

Next, you want to look at the daytime population of the boroughs. You decide on the daytime population as your target customers. The following code displays the first 10 rows of the data file ***daytime-population-borough3.csv***.  To examine the entire file, download the file and open it using a text editor or Microsoft Excel.

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>Area code</th>      <th>Area name</th>      <th>Male</th>      <th>Female</th>      <th>Total</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>E09000001</td>      <td>City of London</td>      <td>220,265</td>      <td>139,810</td>      <td>360,075</td>    </tr>    <tr>      <th>1</th>      <td>E09000002</td>      <td>Barking and Dagenham</td>      <td>84,724</td>      <td>84,393</td>      <td>169,117</td>    </tr>    <tr>      <th>2</th>      <td>E09000003</td>      <td>Barnet</td>      <td>147,455</td>      <td>167,037</td>      <td>314,492</td>    </tr>    <tr>      <th>3</th>      <td>E09000004</td>      <td>Bexley</td>      <td>95,155</td>      <td>101,364</td>      <td>196,519</td>    </tr>    <tr>      <th>4</th>      <td>E09000005</td>      <td>Brent</td>      <td>140,763</td>      <td>138,111</td>      <td>278,874</td>    </tr>    <tr>      <th>5</th>      <td>E09000006</td>      <td>Bromley</td>      <td>125,338</td>      <td>143,952</td>      <td>269,290</td>    </tr>    <tr>      <th>6</th>      <td>E09000007</td>      <td>Camden</td>      <td>192,443</td>      <td>191,664</td>      <td>384,107</td>    </tr>    <tr>      <th>7</th>      <td>E09000008</td>      <td>Croydon</td>      <td>148,975</td>      <td>161,666</td>      <td>310,641</td>    </tr>    <tr>      <th>8</th>      <td>E09000009</td>      <td>Ealing</td>      <td>154,820</td>      <td>151,187</td>      <td>306,007</td>    </tr>    <tr>      <th>9</th>      <td>E09000010</td>      <td>Enfield</td>      <td>134,437</td>      <td>145,787</td>      <td>280,224</td>    </tr>  </tbody></table>

You decide that housing density (including population density) would be a useful measure for determining where to place your restaurant. The following table shows the first 10 rows of the file ***housing-density-borough.csv*** containing population densitity.  To examine the entire file, download the file and open it using a text editor or Microsoft Excel.

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>Code</th>      <th>Name</th>      <th>Year</th>      <th>Source</th>      <th>Population</th>      <th>Inland_Area _Hectares</th>      <th>Total_Area_Hectares</th>      <th>Population_per_hectare</th>      <th>Square_Kilometres</th>      <th>Population_per_square_kilometre</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>E09000001</td>      <td>City of London</td>      <td>1999</td>      <td>ONS MYE</td>      <td>6581</td>      <td>290.4</td>      <td>314.9</td>      <td>22.7</td>      <td>2.9</td>      <td>2266.2</td>    </tr>    <tr>      <th>1</th>      <td>E09000001</td>      <td>City of London</td>      <td>2000</td>      <td>ONS MYE</td>      <td>7014</td>      <td>290.4</td>      <td>314.9</td>      <td>24.2</td>      <td>2.9</td>      <td>2415.3</td>    </tr>    <tr>      <th>2</th>      <td>E09000001</td>      <td>City of London</td>      <td>2001</td>      <td>ONS MYE</td>      <td>7359</td>      <td>290.4</td>      <td>314.9</td>      <td>25.3</td>      <td>2.9</td>      <td>2534.1</td>    </tr>    <tr>      <th>3</th>      <td>E09000001</td>      <td>City of London</td>      <td>2002</td>      <td>ONS MYE</td>      <td>7280</td>      <td>290.4</td>      <td>314.9</td>      <td>25.1</td>      <td>2.9</td>      <td>2506.9</td>    </tr>    <tr>      <th>4</th>      <td>E09000001</td>      <td>City of London</td>      <td>2003</td>      <td>ONS MYE</td>      <td>7115</td>      <td>290.4</td>      <td>314.9</td>      <td>24.5</td>      <td>2.9</td>      <td>2450.1</td>    </tr>    <tr>      <th>5</th>      <td>E09000001</td>      <td>City of London</td>      <td>2004</td>      <td>ONS MYE</td>      <td>7118</td>      <td>290.4</td>      <td>314.9</td>      <td>24.5</td>      <td>2.9</td>      <td>2451.2</td>    </tr>    <tr>      <th>6</th>      <td>E09000001</td>      <td>City of London</td>      <td>2005</td>      <td>ONS MYE</td>      <td>7131</td>      <td>290.4</td>      <td>314.9</td>      <td>24.6</td>      <td>2.9</td>      <td>2455.6</td>    </tr>    <tr>      <th>7</th>      <td>E09000001</td>      <td>City of London</td>      <td>2006</td>      <td>ONS MYE</td>      <td>7254</td>      <td>290.4</td>      <td>314.9</td>      <td>25.0</td>      <td>2.9</td>      <td>2498.0</td>    </tr>    <tr>      <th>8</th>      <td>E09000001</td>      <td>City of London</td>      <td>2007</td>      <td>ONS MYE</td>      <td>7607</td>      <td>290.4</td>      <td>314.9</td>      <td>26.2</td>      <td>2.9</td>      <td>2619.5</td>    </tr>    <tr>      <th>9</th>      <td>E09000001</td>      <td>City of London</td>      <td>2008</td>      <td>ONS MYE</td>      <td>7429</td>      <td>290.4</td>      <td>314.9</td>      <td>25.6</td>      <td>2.9</td>      <td>2558.3</td>    </tr>  </tbody></table>

Question: There are a lot of columns in this dataframe. Which do you think you should keep and which should you drop?

From examining the table above and the file (***housing-density-borough.csv***) you notice that the data contains information for several years. You decide to filter the data set and only keep rows where the year is 2022. 

This is a data projection data set so you only want the projections for the present and, perhaps later, future years. Later on you can look around for more recent population data for the boroughs rather than projections.

The table below shows the first 10 rows for the year 2022:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>Code</th>      <th>Name</th>      <th>Year</th>      <th>Source</th>      <th>Population</th>      <th>Inland_Area _Hectares</th>      <th>Total_Area_Hectares</th>      <th>Population_per_hectare</th>      <th>Square_Kilometres</th>      <th>Population_per_square_kilometre</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>E09000001</td>      <td>City of London</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>8289</td>      <td>290.4</td>      <td>314.9</td>      <td>28.5</td>      <td>2.9</td>      <td>2854.4</td>    </tr>    <tr>      <th>1</th>      <td>E09000002</td>      <td>Barking and Dagenham</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>224407</td>      <td>3610.8</td>      <td>3779.9</td>      <td>62.1</td>      <td>36.1</td>      <td>6214.9</td>    </tr>    <tr>      <th>2</th>      <td>E09000003</td>      <td>Barnet</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>415041</td>      <td>8674.8</td>      <td>8674.8</td>      <td>47.8</td>      <td>86.7</td>      <td>4784.4</td>    </tr>    <tr>      <th>3</th>      <td>E09000004</td>      <td>Bexley</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>258607</td>      <td>6058.1</td>      <td>6428.6</td>      <td>42.7</td>      <td>60.6</td>      <td>4268.8</td>    </tr>    <tr>      <th>4</th>      <td>E09000005</td>      <td>Brent</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>348783</td>      <td>4323.3</td>      <td>4323.3</td>      <td>80.7</td>      <td>43.2</td>      <td>8067.6</td>    </tr>    <tr>      <th>5</th>      <td>E09000006</td>      <td>Bromley</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>342548</td>      <td>15013.5</td>      <td>15013.5</td>      <td>22.8</td>      <td>150.1</td>      <td>2281.6</td>    </tr>    <tr>      <th>6</th>      <td>E09000007</td>      <td>Camden</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>261082</td>      <td>2178.9</td>      <td>2178.9</td>      <td>119.8</td>      <td>21.8</td>      <td>11982.1</td>    </tr>    <tr>      <th>7</th>      <td>E09000008</td>      <td>Croydon</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>406399</td>      <td>8650.4</td>      <td>8650.4</td>      <td>47.0</td>      <td>86.5</td>      <td>4698.1</td>    </tr>    <tr>      <th>8</th>      <td>E09000009</td>      <td>Ealing</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>377916</td>      <td>5554.4</td>      <td>5554.4</td>      <td>68.0</td>      <td>55.5</td>      <td>6803.9</td>    </tr>    <tr>      <th>9</th>      <td>E09000010</td>      <td>Enfield</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>349994</td>      <td>8083.2</td>      <td>8220.1</td>      <td>43.3</td>      <td>80.8</td>      <td>4329.9</td>    </tr>  </tbody></table>

Now you want to look closer at this data set to see whether there is any data missing in the columns and what type of data you are dealing with. The following table shows a summary of all the columns in the ***housing-density-borough.csv*** data set for 2022.  The column ***Non-Null Count*** shows how many rows have values (non-null).  The column ***Data Type*** shows the basic data types you learned in Python: string, int and float:

<table border="1" class="dataframe"><thead><th>Column</th><th>Non-Null Count</th><th>Data Type</th></thead>
<tr><td>Code</td><td>36 non-null</td><td>string</td></tr>
<tr><td>Name</td><td>36 non-null</td><td>string</td></tr>
<tr><td>Year</td><td>36 non-null</td><td>int</td></tr>
<tr><td>Source</td><td>36 non-null</td><td>string</td></tr>
<tr><td>Population</td><td>36 non-null</td><td>int</td></tr>
<tr><td>Inland_Area _Hectares</td><td>36 non-null</td><td>float</td></tr>
<tr><td>Total_Area_Hectares</td><td>36 non-null</td><td>float</td></tr>
<tr><td>Population_per_hectare</td><td>36 non-null</td><td>float</td></tr>
<tr><td>Square_Kilometres</td><td>36 non-null</td><td>float</td></tr>
<tr><td>Population_per_square_kilometre</td><td>36 non-null</td><td>float</td></tr>
</table>

The table below contains the last 10 rows of the 2022 ***housing-density-borough.csv***. Notice that the last three rows are not boroughs so they need to be dropped from your analysis.

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>Code</th>      <th>Name</th>      <th>Year</th>      <th>Source</th>      <th>Population</th>      <th>Inland_Area _Hectares</th>      <th>Total_Area_Hectares</th>      <th>Population_per_hectare</th>      <th>Square_Kilometres</th>      <th>Population_per_square_kilometre</th>    </tr>  </thead>  <tbody>    <tr>      <th>26</th>      <td>E09000027</td>      <td>Richmond upon Thames</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>205051</td>      <td>5740.7</td>      <td>5876.1</td>      <td>35.7</td>      <td>57.4</td>      <td>3571.9</td>    </tr>    <tr>      <th>27</th>      <td>E09000028</td>      <td>Southwark</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>334826</td>      <td>2886.2</td>      <td>2991.3</td>      <td>116.0</td>      <td>28.9</td>      <td>11600.9</td>    </tr>    <tr>      <th>28</th>      <td>E09000029</td>      <td>Sutton</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>215228</td>      <td>4384.7</td>      <td>4384.7</td>      <td>49.1</td>      <td>43.8</td>      <td>4908.6</td>    </tr>    <tr>      <th>29</th>      <td>E09000030</td>      <td>Tower Hamlets</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>334834</td>      <td>1978.1</td>      <td>2157.9</td>      <td>169.3</td>      <td>19.8</td>      <td>16926.8</td>    </tr>    <tr>      <th>30</th>      <td>E09000031</td>      <td>Waltham Forest</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>295311</td>      <td>3880.8</td>      <td>3880.8</td>      <td>76.1</td>      <td>38.8</td>      <td>7609.5</td>    </tr>    <tr>      <th>31</th>      <td>E09000032</td>      <td>Wandsworth</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>343147</td>      <td>3426.4</td>      <td>3522.0</td>      <td>100.1</td>      <td>34.3</td>      <td>10014.7</td>    </tr>    <tr>      <th>32</th>      <td>E09000033</td>      <td>Westminster</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>264875</td>      <td>2148.7</td>      <td>2203.0</td>      <td>123.3</td>      <td>21.5</td>      <td>12327.2</td>    </tr>    <tr>      <th>33</th>      <td>E12000007</td>      <td>Greater London</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>9390069</td>      <td>157214.7</td>      <td>159470.6</td>      <td>59.7</td>      <td>1572.1</td>      <td>5972.8</td>    </tr>    <tr>      <th>34</th>      <td>E13000001</td>      <td>Inner London</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>3781965</td>      <td>31929.2</td>      <td>32796.3</td>      <td>118.4</td>      <td>319.3</td>      <td>11844.8</td>    </tr>    <tr>      <th>35</th>      <td>E13000002</td>      <td>Outer London</td>      <td>2022</td>      <td>GLA Population Projections</td>      <td>5608104</td>      <td>125423.6</td>      <td>126675.6</td>      <td>44.7</td>      <td>1254.2</td>      <td>4471.3</td>    </tr>  </tbody></table>

Everything looks good, so now you decide to drop some columns. What are some columns that may not be required for your analysis? You decide to drop columns ***Source***, ***Inland_Area _Hectares***, ***Total_Area_Hectares*** and ***Square_Kilometres***.  This is how the first 10 rows of the new data set now look:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>Code</th>      <th>Name</th>      <th>Year</th>      <th>Population</th>      <th>Population_per_hectare</th>      <th>Population_per_square_kilometre</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>E09000001</td>      <td>City of London</td>      <td>2022</td>      <td>8289</td>      <td>28.5</td>      <td>2854.4</td>    </tr>    <tr>      <th>1</th>      <td>E09000002</td>      <td>Barking and Dagenham</td>      <td>2022</td>      <td>224407</td>      <td>62.1</td>      <td>6214.9</td>    </tr>    <tr>      <th>2</th>      <td>E09000003</td>      <td>Barnet</td>      <td>2022</td>      <td>415041</td>      <td>47.8</td>      <td>4784.4</td>    </tr>    <tr>      <th>3</th>      <td>E09000004</td>      <td>Bexley</td>      <td>2022</td>      <td>258607</td>      <td>42.7</td>      <td>4268.8</td>    </tr>    <tr>      <th>4</th>      <td>E09000005</td>      <td>Brent</td>      <td>2022</td>      <td>348783</td>      <td>80.7</td>      <td>8067.6</td>    </tr>    <tr>      <th>5</th>      <td>E09000006</td>      <td>Bromley</td>      <td>2022</td>      <td>342548</td>      <td>22.8</td>      <td>2281.6</td>    </tr>    <tr>      <th>6</th>      <td>E09000007</td>      <td>Camden</td>      <td>2022</td>      <td>261082</td>      <td>119.8</td>      <td>11982.1</td>    </tr>    <tr>      <th>7</th>      <td>E09000008</td>      <td>Croydon</td>      <td>2022</td>      <td>406399</td>      <td>47.0</td>      <td>4698.1</td>    </tr>    <tr>      <th>8</th>      <td>E09000009</td>      <td>Ealing</td>      <td>2022</td>      <td>377916</td>      <td>68.0</td>      <td>6803.9</td>    </tr>    <tr>      <th>9</th>      <td>E09000010</td>      <td>Enfield</td>      <td>2022</td>      <td>349994</td>      <td>43.3</td>      <td>4329.9</td>    </tr>  </tbody></table>

You now decide to check out the household income in each borough contained in file ***modelled-household-income-estimates-borough.csv*** data set. The table below displays the first 10 rows of the file. To examine the entire file, download the file and open it using a text editor or Microsoft Excel.

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>Code</th>      <th>Borough</th>      <th>Measure</th>      <th>Year</th>      <th>Income</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>E09000001</td>      <td>City of London</td>      <td>Mean</td>      <td>2001/02</td>      <td>65120</td>    </tr>    <tr>      <th>1</th>      <td>E09000002</td>      <td>Barking and Dagenham</td>      <td>Mean</td>      <td>2001/02</td>      <td>22930</td>    </tr>    <tr>      <th>2</th>      <td>E09000003</td>      <td>Barnet</td>      <td>Mean</td>      <td>2001/02</td>      <td>39190</td>    </tr>    <tr>      <th>3</th>      <td>E09000004</td>      <td>Bexley</td>      <td>Mean</td>      <td>2001/02</td>      <td>30060</td>    </tr>    <tr>      <th>4</th>      <td>E09000005</td>      <td>Brent</td>      <td>Mean</td>      <td>2001/02</td>      <td>29430</td>    </tr>    <tr>      <th>5</th>      <td>E09000006</td>      <td>Bromley</td>      <td>Mean</td>      <td>2001/02</td>      <td>37110</td>    </tr>    <tr>      <th>6</th>      <td>E09000007</td>      <td>Camden</td>      <td>Mean</td>      <td>2001/02</td>      <td>47940</td>    </tr>    <tr>      <th>7</th>      <td>E09000008</td>      <td>Croydon</td>      <td>Mean</td>      <td>2001/02</td>      <td>31740</td>    </tr>    <tr>      <th>8</th>      <td>E09000009</td>      <td>Ealing</td>      <td>Mean</td>      <td>2001/02</td>      <td>34000</td>    </tr>    <tr>      <th>9</th>      <td>E09000010</td>      <td>Enfield</td>      <td>Mean</td>      <td>2001/02</td>      <td>30750</td>    </tr>  </tbody></table>

Question: What problems do you see with this data set?

The first thing you notice is that the data only goes up to 2012/13. You decide to use this year for now, but you may want to look for newer data later on. You go ahead and extract the data for this year into a temporary new data frame.

Question: Why are you content with just using the latest year, for now, rather than finding data for 2022?

You decide to only keep data for Year 2012/13  where the ***Measure*** column is equal to ***Mean***.  After filtering the data, you also drop the columns ***Measure*** and ***Year*** from the data set since they are not needed for now. You also notice that there are rows that contain non-borough information that need to be removed. The table below shows the first 10 rows of the new data set:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>Code</th>      <th>Borough</th>      <th>Income</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>E09000001</td>      <td>City of London</td>      <td>99390</td>    </tr>    <tr>      <th>1</th>      <td>E09000002</td>      <td>Barking and Dagenham</td>      <td>34080</td>    </tr>    <tr>      <th>2</th>      <td>E09000003</td>      <td>Barnet</td>      <td>54530</td>    </tr>    <tr>      <th>3</th>      <td>E09000004</td>      <td>Bexley</td>      <td>44430</td>    </tr>    <tr>      <th>4</th>      <td>E09000005</td>      <td>Brent</td>      <td>39630</td>    </tr>    <tr>      <th>5</th>      <td>E09000006</td>      <td>Bromley</td>      <td>55140</td>    </tr>    <tr>      <th>6</th>      <td>E09000007</td>      <td>Camden</td>      <td>67990</td>    </tr>    <tr>      <th>7</th>      <td>E09000008</td>      <td>Croydon</td>      <td>45120</td>    </tr>    <tr>      <th>8</th>      <td>E09000009</td>      <td>Ealing</td>      <td>45690</td>    </tr>    <tr>      <th>9</th>      <td>E09000010</td>      <td>Enfield</td>      <td>41250</td>    </tr>  </tbody></table>

You now have population, population density, income, number of takeaways and some other basic information in the data frames. You think that it is now worth doing some initial exploratory data analysis on the data you have collected and reformatted, so finally, you need to decide which columns you are going to take from each data set to make a final preliminary data set for the analysis. You want to show all of the data set heads together so you can decide.

Question: Which columns will you use from each of the data frames that will be helpful in making your decision about location?

After having checked the data frames, you start to combine the columns you want from each separate data set into one data set. The table below shows the combined data set with 6 columns: ***PostCode***, ***Borough***, ***Number***, ***Population***, ***Density*** and ***Income***:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>PostCode</th>      <th>Borough</th>      <th>Number</th>      <th>Density</th>      <th>Income</th>      <th>Population</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>E09000001</td>      <td>City of London</td>      <td>155</td>      <td>2854.4</td>      <td>99390</td>      <td>8289</td>    </tr>    <tr>      <th>1</th>      <td>E09000002</td>      <td>Barking and Dagenham</td>      <td>135</td>      <td>6214.9</td>      <td>34080</td>      <td>224407</td>    </tr>    <tr>      <th>2</th>      <td>E09000003</td>      <td>Barnet</td>      <td>180</td>      <td>4784.4</td>      <td>54530</td>      <td>415041</td>    </tr>    <tr>      <th>3</th>      <td>E09000004</td>      <td>Bexley</td>      <td>125</td>      <td>4268.8</td>      <td>44430</td>      <td>258607</td>    </tr>    <tr>      <th>4</th>      <td>E09000005</td>      <td>Brent</td>      <td>155</td>      <td>8067.6</td>      <td>39630</td>      <td>348783</td>    </tr>    <tr>      <th>5</th>      <td>E09000006</td>      <td>Bromley</td>      <td>175</td>      <td>2281.6</td>      <td>55140</td>      <td>342548</td>    </tr>    <tr>      <th>6</th>      <td>E09000007</td>      <td>Camden</td>      <td>225</td>      <td>11982.1</td>      <td>67990</td>      <td>261082</td>    </tr>    <tr>      <th>7</th>      <td>E09000008</td>      <td>Croydon</td>      <td>245</td>      <td>4698.1</td>      <td>45120</td>      <td>406399</td>    </tr>    <tr>      <th>8</th>      <td>E09000009</td>      <td>Ealing</td>      <td>160</td>      <td>6803.9</td>      <td>45690</td>      <td>377916</td>    </tr>    <tr>      <th>9</th>      <td>E09000010</td>      <td>Enfield</td>      <td>165</td>      <td>4329.9</td>      <td>41250</td>      <td>349994</td>    </tr>    <tr>      <th>10</th>      <td>E09000011</td>      <td>Greenwich</td>      <td>140</td>      <td>6275.4</td>      <td>44370</td>      <td>297039</td>    </tr>    <tr>      <th>11</th>      <td>E09000012</td>      <td>Hackney</td>      <td>175</td>      <td>15450.2</td>      <td>42690</td>      <td>294312</td>    </tr>    <tr>      <th>12</th>      <td>E09000013</td>      <td>Hammersmith and Fulham</td>      <td>115</td>      <td>12282.8</td>      <td>62910</td>      <td>201406</td>    </tr>    <tr>      <th>13</th>      <td>E09000014</td>      <td>Haringey</td>      <td>130</td>      <td>9916.2</td>      <td>45860</td>      <td>293503</td>    </tr>    <tr>      <th>14</th>      <td>E09000015</td>      <td>Harrow</td>      <td>120</td>      <td>5260.3</td>      <td>49060</td>      <td>265449</td>    </tr>    <tr>      <th>15</th>      <td>E09000016</td>      <td>Havering</td>      <td>155</td>      <td>2415.4</td>      <td>44430</td>      <td>271368</td>    </tr>    <tr>      <th>16</th>      <td>E09000017</td>      <td>Hillingdon</td>      <td>185</td>      <td>2787.3</td>      <td>44950</td>      <td>322492</td>    </tr>    <tr>      <th>17</th>      <td>E09000018</td>      <td>Hounslow</td>      <td>135</td>      <td>5169.1</td>      <td>44490</td>      <td>289358</td>    </tr>    <tr>      <th>18</th>      <td>E09000019</td>      <td>Islington</td>      <td>185</td>      <td>16547.4</td>      <td>54950</td>      <td>245839</td>    </tr>    <tr>      <th>19</th>      <td>E09000020</td>      <td>Kensington and Chelsea</td>      <td>80</td>      <td>13398.7</td>      <td>116350</td>      <td>162446</td>    </tr>    <tr>      <th>20</th>      <td>E09000021</td>      <td>Kingston upon Thames</td>      <td>100</td>      <td>5003.4</td>      <td>56920</td>      <td>186434</td>    </tr>    <tr>      <th>21</th>      <td>E09000022</td>      <td>Lambeth</td>      <td>175</td>      <td>12830.3</td>      <td>48610</td>      <td>343982</td>    </tr>    <tr>      <th>22</th>      <td>E09000023</td>      <td>Lewisham</td>      <td>175</td>      <td>9201.4</td>      <td>43360</td>      <td>323421</td>    </tr>    <tr>      <th>23</th>      <td>E09000024</td>      <td>Merton</td>      <td>110</td>      <td>5758.5</td>      <td>57160</td>      <td>216662</td>    </tr>    <tr>      <th>24</th>      <td>E09000025</td>      <td>Newham</td>      <td>185</td>      <td>10221.6</td>      <td>34260</td>      <td>370004</td>    </tr>    <tr>      <th>25</th>      <td>E09000026</td>      <td>Redbridge</td>      <td>165</td>      <td>5672.2</td>      <td>45380</td>      <td>320018</td>    </tr>    <tr>      <th>26</th>      <td>E09000027</td>      <td>Richmond upon Thames</td>      <td>85</td>      <td>3571.9</td>      <td>76610</td>      <td>205051</td>    </tr>    <tr>      <th>27</th>      <td>E09000028</td>      <td>Southwark</td>      <td>200</td>      <td>11600.9</td>      <td>48000</td>      <td>334826</td>    </tr>    <tr>      <th>28</th>      <td>E09000029</td>      <td>Sutton</td>      <td>120</td>      <td>4908.6</td>      <td>49170</td>      <td>215228</td>    </tr>    <tr>      <th>29</th>      <td>E09000030</td>      <td>Tower Hamlets</td>      <td>220</td>      <td>16926.8</td>      <td>45720</td>      <td>334834</td>    </tr>    <tr>      <th>30</th>      <td>E09000031</td>      <td>Waltham Forest</td>      <td>155</td>      <td>7609.5</td>      <td>39460</td>      <td>295311</td>    </tr>    <tr>      <th>31</th>      <td>E09000032</td>      <td>Wandsworth</td>      <td>165</td>      <td>10014.7</td>      <td>66220</td>      <td>343147</td>    </tr>    <tr>      <th>32</th>      <td>E09000033</td>      <td>Westminster</td>      <td>305</td>      <td>12327.2</td>      <td>80760</td>      <td>264875</td>    </tr>  </tbody></table>

You are now at a point where you want to do some feature engineering. You have the basic data, but you can now do some maths and create new columns. A simple example is that you may want to see how many takeaways there are per person (which will be a decimal number, but you can make it into a percentage). Alternatively, you could make it the number of people per number of takeaways, but let's see how the other way works. Let's call this new column NumPerPerson (number of takeaways per person in the borough).

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>PostCode</th>      <th>Borough</th>      <th>Number</th>      <th>Density</th>      <th>Income</th>      <th>Population</th>      <th>NumPerPerson</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>E09000001</td>      <td>City of London</td>      <td>155</td>      <td>2854.4</td>      <td>99390</td>      <td>8289</td>      <td>1.869948</td>    </tr>    <tr>      <th>1</th>      <td>E09000002</td>      <td>Barking and Dagenham</td>      <td>135</td>      <td>6214.9</td>      <td>34080</td>      <td>224407</td>      <td>0.060159</td>    </tr>    <tr>      <th>2</th>      <td>E09000003</td>      <td>Barnet</td>      <td>180</td>      <td>4784.4</td>      <td>54530</td>      <td>415041</td>      <td>0.043369</td>    </tr>    <tr>      <th>3</th>      <td>E09000004</td>      <td>Bexley</td>      <td>125</td>      <td>4268.8</td>      <td>44430</td>      <td>258607</td>      <td>0.048336</td>    </tr>    <tr>      <th>4</th>      <td>E09000005</td>      <td>Brent</td>      <td>155</td>      <td>8067.6</td>      <td>39630</td>      <td>348783</td>      <td>0.044440</td>    </tr>    <tr>      <th>5</th>      <td>E09000006</td>      <td>Bromley</td>      <td>175</td>      <td>2281.6</td>      <td>55140</td>      <td>342548</td>      <td>0.051088</td>    </tr>    <tr>      <th>6</th>      <td>E09000007</td>      <td>Camden</td>      <td>225</td>      <td>11982.1</td>      <td>67990</td>      <td>261082</td>      <td>0.086180</td>    </tr>    <tr>      <th>7</th>      <td>E09000008</td>      <td>Croydon</td>      <td>245</td>      <td>4698.1</td>      <td>45120</td>      <td>406399</td>      <td>0.060286</td>    </tr>    <tr>      <th>8</th>      <td>E09000009</td>      <td>Ealing</td>      <td>160</td>      <td>6803.9</td>      <td>45690</td>      <td>377916</td>      <td>0.042337</td>    </tr>    <tr>      <th>9</th>      <td>E09000010</td>      <td>Enfield</td>      <td>165</td>      <td>4329.9</td>      <td>41250</td>      <td>349994</td>      <td>0.047144</td>    </tr>    <tr>      <th>10</th>      <td>E09000011</td>      <td>Greenwich</td>      <td>140</td>      <td>6275.4</td>      <td>44370</td>      <td>297039</td>      <td>0.047132</td>    </tr>    <tr>      <th>11</th>      <td>E09000012</td>      <td>Hackney</td>      <td>175</td>      <td>15450.2</td>      <td>42690</td>      <td>294312</td>      <td>0.059461</td>    </tr>    <tr>      <th>12</th>      <td>E09000013</td>      <td>Hammersmith and Fulham</td>      <td>115</td>      <td>12282.8</td>      <td>62910</td>      <td>201406</td>      <td>0.057099</td>    </tr>    <tr>      <th>13</th>      <td>E09000014</td>      <td>Haringey</td>      <td>130</td>      <td>9916.2</td>      <td>45860</td>      <td>293503</td>      <td>0.044293</td>    </tr>    <tr>      <th>14</th>      <td>E09000015</td>      <td>Harrow</td>      <td>120</td>      <td>5260.3</td>      <td>49060</td>      <td>265449</td>      <td>0.045206</td>    </tr>    <tr>      <th>15</th>      <td>E09000016</td>      <td>Havering</td>      <td>155</td>      <td>2415.4</td>      <td>44430</td>      <td>271368</td>      <td>0.057118</td>    </tr>    <tr>      <th>16</th>      <td>E09000017</td>      <td>Hillingdon</td>      <td>185</td>      <td>2787.3</td>      <td>44950</td>      <td>322492</td>      <td>0.057366</td>    </tr>    <tr>      <th>17</th>      <td>E09000018</td>      <td>Hounslow</td>      <td>135</td>      <td>5169.1</td>      <td>44490</td>      <td>289358</td>      <td>0.046655</td>    </tr>    <tr>      <th>18</th>      <td>E09000019</td>      <td>Islington</td>      <td>185</td>      <td>16547.4</td>      <td>54950</td>      <td>245839</td>      <td>0.075253</td>    </tr>    <tr>      <th>19</th>      <td>E09000020</td>      <td>Kensington and Chelsea</td>      <td>80</td>      <td>13398.7</td>      <td>116350</td>      <td>162446</td>      <td>0.049247</td>    </tr>    <tr>      <th>20</th>      <td>E09000021</td>      <td>Kingston upon Thames</td>      <td>100</td>      <td>5003.4</td>      <td>56920</td>      <td>186434</td>      <td>0.053638</td>    </tr>    <tr>      <th>21</th>      <td>E09000022</td>      <td>Lambeth</td>      <td>175</td>      <td>12830.3</td>      <td>48610</td>      <td>343982</td>      <td>0.050875</td>    </tr>    <tr>      <th>22</th>      <td>E09000023</td>      <td>Lewisham</td>      <td>175</td>      <td>9201.4</td>      <td>43360</td>      <td>323421</td>      <td>0.054109</td>    </tr>    <tr>      <th>23</th>      <td>E09000024</td>      <td>Merton</td>      <td>110</td>      <td>5758.5</td>      <td>57160</td>      <td>216662</td>      <td>0.050770</td>    </tr>    <tr>      <th>24</th>      <td>E09000025</td>      <td>Newham</td>      <td>185</td>      <td>10221.6</td>      <td>34260</td>      <td>370004</td>      <td>0.049999</td>    </tr>    <tr>      <th>25</th>      <td>E09000026</td>      <td>Redbridge</td>      <td>165</td>      <td>5672.2</td>      <td>45380</td>      <td>320018</td>      <td>0.051560</td>    </tr>    <tr>      <th>26</th>      <td>E09000027</td>      <td>Richmond upon Thames</td>      <td>85</td>      <td>3571.9</td>      <td>76610</td>      <td>205051</td>      <td>0.041453</td>    </tr>    <tr>      <th>27</th>      <td>E09000028</td>      <td>Southwark</td>      <td>200</td>      <td>11600.9</td>      <td>48000</td>      <td>334826</td>      <td>0.059733</td>    </tr>    <tr>      <th>28</th>      <td>E09000029</td>      <td>Sutton</td>      <td>120</td>      <td>4908.6</td>      <td>49170</td>      <td>215228</td>      <td>0.055755</td>    </tr>    <tr>      <th>29</th>      <td>E09000030</td>      <td>Tower Hamlets</td>      <td>220</td>      <td>16926.8</td>      <td>45720</td>      <td>334834</td>      <td>0.065704</td>    </tr>    <tr>      <th>30</th>      <td>E09000031</td>      <td>Waltham Forest</td>      <td>155</td>      <td>7609.5</td>      <td>39460</td>      <td>295311</td>      <td>0.052487</td>    </tr>    <tr>      <th>31</th>      <td>E09000032</td>      <td>Wandsworth</td>      <td>165</td>      <td>10014.7</td>      <td>66220</td>      <td>343147</td>      <td>0.048084</td>    </tr>    <tr>      <th>32</th>      <td>E09000033</td>      <td>Westminster</td>      <td>305</td>      <td>12327.2</td>      <td>80760</td>      <td>264875</td>      <td>0.115149</td>    </tr>  </tbody></table>

You are now ready to visualise the data. EDA is a very important aspect of data science, and you will spend some time looking at the data and trying to extract useful information and inferences from it.

In finding a location for your takeaway, you must consider many factors. So far, you have information for all of the boroughs in terms of income, population, population density, mean income, number of takeaways and number of takeaways per person. Now you have to ask yourself, ‘How can I use this information to find the best place to locate my business?’

Question: Can you explain how each of these factors can help you find the best location? Try to think in terms of each factor by itself and then how you might use combinations of the factors to find more useful information.

Question: You only have limited information at the moment, and you can of course find much more information from other data sets, particularly some commercial data sets. What other factors would you ideally use in finding the best location for your business? Why these factors? What are the limitations of your present approach to finding the best location so far?

After careful consideration, you realise that you want to see how population and income are related to number of takeaways in a borough. You decide to plot these columns one next to the other for comparison.

The visualisation below shows how the Population and Income varies with the Number of takeaways in a borough:

![London Boroughs](Population_Income.png)

You notice that population tends to increase with number of takeaways in a borough. However, the average income of people in a borough seems to stay in between £40,000 and £60,000 regardless of the number of takeaways. Knowing that each data point represents a borough, you can refer back to the data frame to see what each data point is in terms of borough.

Question: What conclusions do you draw from the two scatter plots above? How might these ‘trends’ be helpful in deciding on location?

Next you decide to look at how income changes with postcode, but you also want to see how this relationship varies with the number of takeaways in a borough. You decide to use a bubble diagram where the sizes of the bubbles are related to the number of takeaways in a borough.

![London Boroughs](PostCodeVIncomeWithNumberOfTakeAways.png)

You can see some trends in the data visualisation, but you also see some outliers (data points that seem to go against the trend). The sizes of the bubbles also give you some information.

### Questions:

- What can you conclude from this bubble plot?
- What, if any, trends do you observe?
- What outliers do you see? Why do you think these boroughs are outliers?
- Looking at this bubble diagram, where would you say would be a good place to put your business?
- Has this plot been helpful in making a decision on location?
- After looking at this diagram, what other information do you think may be helpful?
- You decide that you would like to make this plot better by adding the borough names so you produce the bubble chart below.

You decide that you would like to make this plot better by adding the borough names, so you produce the bubble chart below:

![London Boroughs](IncomevPostcodeWithBubbleSizeAsNumber.png)

### Questions:

- Why do you think that the colours of the borough bubbles seem to be grouped together?
- Why do you think there appears to be an income cut-off for the number of takeaways?
- Which boroughs are outliers here? Why do you think they are outliers?


Next, you want to see whether income and population, where the number correlates to bubble size, will be helpful.

![London Boroughs](IncomevPostcodeWithBubbleSizeAsNumber2.png)

You can see that most of the boroughs lie on the upper-left quadrant. You can also see that larger bubbles seem to be located in boroughs with higher income and population. It is a little bit difficult to distinguish the boroughs, so you decide to look back at the data frame df_stats to clarify which boroughs have those characteristics.

### Questions:

- What can you infer from the previous two bubble plots? 
- What sort of relationships that occur in the plots do you think are important for placement of your business, and why? 
- Why might outliers be important to locating your business?

You start to realise that although the relationships between the variables in the data is useful, the outliers or other data points that do not follow the pattern of the other data points may be useful in your decision. It may be good to locate your business in an area that is out of the general trend or is an outlier, as there may be more opportunities for takeaways in that area. As a business analyst, you know that both trends and outliers provide useful information, which taken together can provide impactful insights.

You decide to keep looking and noting both the relationships and data points that do not follow the trend.

The bubble chart below has the bubble size as Number: 

![London Boroughs](IncomevPostcodeWithBubbleSizeAsNumber3.png)

You note that there appears to be a relationship between income and density, but you may have to rescale the diagram to see that income tends to increase with population density. You could remove the outlier at the £100,000 income. At this point, you also note that bubble size seems larger with higher income and density, but you will have to explore this further. For now, you want to consider the number of takeaways per person statistic.

The bubble chart below has Number of takeaways as bubble size:

![London Boroughs](IncomevPostcodeWithBubbleSizeAsNumber4.png)

This bubble chart is useful, as it shows income and population cut-offs for the boroughs. It also shows that the density of takeaways per person is fairly uniform over the majority of boroughs. This is useful information to you. The outliers to the right are also interesting.

### Questions:

- How have these vizualisations helped in making a decision about where to place your business?
- In particular, how have relationships and outliers helped in making your decision so far?
- If you had to make a decision now, which three boroughs would you choose to locate your business?

There is a lot of data to consider, so you now decide to start imposing more order on the data. What sort of charts might you want to look at next, and why?

You decide to order the relationship between number and postcode. The image below shows a bar chart of number of orders per postcode:

![London Boroughs](numberAndPostCodeBarChart.png)

The image below shows the same bar chart of number of orders per postcode ordered by number of orders:

![London Boroughs](numberAndPostCodeBarChartOrdered.png)

The image below shows a bar chart of Number per borough:

![London Boroughs](numberAndBoroughBarChart.png)

The image below shows a bar chart with Number per borough ordered by Number:

![London Boroughs](numberAndBoroughBarChartOrdered.png)

Next, the following bar chart shows you the number of takeaways per person per borough: 

![London Boroughs](numberOfTakewaysPerPersonPerBorough.png)

Question: Has ordering the data in this way helped in making your decision about location? If so, how has it helped you?

Before making a final decision, you want to try to see the big picture. You decide to look at correlation maps. Correlation is similar to a trend, only we make a distinction between causation and correlation. For example, in the plot of population against number of takeaways, you saw a definite upward trend in number as population increased. Here, you might infer that the number of takeaways is 'caused' by having a larger population in the borough. That seems to make some sense. However, you have to be very careful, as population by itself is not 'causing' more takeaways. There are many factors that contribute to the number of restaurants in a borough. There may be tax breaks, lenient coding laws, different ethnicities and many other factors. Therefore, until you have performed other tests on the data (for example, coefficient of determination), all that can be said of the relationship between population and number of takeaways is that there is a correlation between the two. In this case, it is a positive correlation, meaning the population increases as the number of takeaways increases, and vice versa, the causes of which are at this point are unclear. For all you know, it could be that having more takeaways in a borough influences people to move to that borough!

Correlation coefficients go from +1 to −1, where +1 means a very strong positive correlation. A correlation coefficient of −1 means the variables are strongly negatively correlated and, therefore, move in opposite directions. A value of zero correlation coefficient means there is no correlation between the variables, and there is a whole continuous range of coefficients in between.

The table below shows the coefficient correlation for all the variables.
<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>Number</th>      <th>Density</th>      <th>Income</th>      <th>Population</th>      <th>NumPerPerson</th>    </tr>  </thead>  <tbody>    <tr>      <th>Number</th>      <td>1.000000</td>      <td>0.316304</td>      <td>-0.133310</td>      <td>0.561998</td>      <td>0.714067</td>    </tr>    <tr>      <th>Density</th>      <td>0.316304</td>      <td>1.000000</td>      <td>0.255473</td>      <td>-0.078826</td>      <td>0.442963</td>    </tr>    <tr>      <th>Income</th>      <td>-0.133310</td>      <td>0.255473</td>      <td>1.000000</td>      <td>-0.502779</td>      <td>0.261303</td>    </tr>    <tr>      <th>Population</th>      <td>0.561998</td>      <td>-0.078826</td>      <td>-0.502779</td>      <td>1.000000</td>      <td>-0.169167</td>    </tr>    <tr>      <th>NumPerPerson</th>      <td>0.714067</td>      <td>0.442963</td>      <td>0.261303</td>      <td>-0.169167</td>      <td>1.000000</td>    </tr>  </tbody></table>


The figure below shows the correlation heat map of all the variables compared to each other.  Note the colour key on the right indicates correlation values for each pair comparison.  The diagonal values are perfect as they are comparisons of a variable to itself.

![London Boroughs](heatMap.png)

Looking at the heat map and table, you can see some useful and interesting information. You notice that the correlation between the NumPerPerson (takeaways per person) to population density has a negative correlation according to the key scale. You also notice that NumPerPerson and income have a good positive correlation and that NumPerPerson has a strong negative correlation to population.

Question: What might these noted correlations mean? Do these correlations, or any others, help you in your location decision?

You want to look at the correlations in the context of the boroughs, so you do a classification pairplot. This is where each pairing of variables is shown in a separate plot and each datapoint is classified (in this case, as a borough by colour).

![London Boroughs](classificationPairplot.png)

Question: Looking at the pairplot above, what do you notice about the plots? What stands out to you?

You notice that there are some outliers. While these outliers are useful in some contexts, they are not helpful here, as they make the scale too tight for the other data points. Therefore, you decide to take out the obvious outliers. Looking at the legend and the table, you find out which boroughs these are and remove them.

The plot below shows the classification pairplot with the outliers removed.

![London Boroughs](classificationPairplotNoOutliers.png)

Looking at the pairplot with the outliers removed, you start to get a good feel for the data and its relationships overall. You can now see how the boroughs are clustering into postcodes as well. There is some interesting information here.

Question: What possible correlations do you see in this pairplot? What conclusions can you draw about location from this pairplot?

Next, you decide to run the correlations again without the outliers. You notice some interesting changes now, which may make more sense. What in particular do you notice that is different?

### Correlation coefficients:

<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>Number</th>      <th>Density</th>      <th>Income</th>      <th>Population</th>      <th>NumPerPerson</th>    </tr>  </thead>  <tbody>    <tr>      <th>Number</th>      <td>1.000000</td>      <td>0.303817</td>      <td>-0.141992</td>      <td>0.585549</td>      <td>0.721069</td>    </tr>    <tr>      <th>Density</th>      <td>0.303817</td>      <td>1.000000</td>      <td>0.250292</td>      <td>-0.047124</td>      <td>0.402011</td>    </tr>    <tr>      <th>Income</th>      <td>-0.141992</td>      <td>0.250292</td>      <td>1.000000</td>      <td>-0.537797</td>      <td>0.280599</td>    </tr>    <tr>      <th>Population</th>      <td>0.585549</td>      <td>-0.047124</td>      <td>-0.537797</td>      <td>1.000000</td>      <td>-0.130861</td>    </tr>    <tr>      <th>NumPerPerson</th>      <td>0.721069</td>      <td>0.402011</td>      <td>0.280599</td>      <td>-0.130861</td>      <td>1.000000</td>    </tr>  </tbody></table>

### Heat Map:

![London Boroughs](HeatMap2.png)


You notice that correlations have significantly changed now that the obvious outliers have been removed. 

Question: What correlations have changed? Has this had an impact on where you would place your take-away restaurant?

Although removing those outliers has helped you in making a better decision, it does not mean that you should always remove outliers. There are certain numerical rules for what defines an outlier, but in addition, you know that outliers contain useful information and should only be removed when you are sure their removal might reveal clearer information. This is where domain knowledge comes into play.

Now it is time to make your decision based on the data you have. Make note of the top three places where you would place your burrito takeaway shop and your reasoning for this decision. Consider what other information you would like to use in making a more accurate location decision. How might this additional information be helpful? (Hint: You could include such factors as crime rate, nearby parks, student colleges, ethnicity, square footage, cost of shop space and age groups.)