#  Capstone Project - The Battle of Neighborhoods (Week 2)

 <div class="alert alert-block alert-info" style="margin-top: 20px"> 
 ## Table of contents
1. [Introduction](#introduction)    
2. [Data Acquisition](#data)
3. [Methodology](#methodology)
4. [Data Acquisition](#data_acquisition)
5. [Analysis And Visualization](#data_analysis)
6. [Modelling](#model)
7. [Results and Prediction](#result)
8. [Conclusion](#conclusion)
9. [Further Development](#further)




## Introduction <a name ="introduction"></a>

**LIDL** is one of the cheapest super market in Europe but it's not represented in many European cities, specially in Scandinavian countries such as Denmark. Surprisingly, the last five years, ** LIDL Group ** have started to build several ** LIDL Supermarket** in different regions of Denmark and specially the north of Denmark called ** Nordjylland ** in Danish language. Therefore, the purpose of this capstone project is to investigate and find out where the next ** LIDL Supermarket** will be built in **Aalborg** the main city of **Nordjylland**.
The target audience is Aalborg Municipality or borough and the stakeholders are of course ** LIDL Group **, the main competitors such as **Rema1000**, **Føtex**, **Fakta**, **ALDI** and ** Aalborg City**


## Data <a name ="data"></a>

To Tackle or solve this problem, I will of course need a Dataset, which will find and locate all the existing  **Lidl** Supermarket and main competitors which already exist in Nordjylland region. I do not have an existing dataset for this problem. For that reason, **Foursquare API ** will be used in order to create a proper dataset that will be used to solve the problem. The dataset will certainly contain:
* All main competitors such as **Rema1000**, **Føtex**, **Fakta**, **ALDI**   and their location
* All the existing **LIDl** supermarket, their geographic location and the distance from the main city.
* Aalborg borough or neighborhood, the population and the number of person per km2.

## Methodology <a name ="methodology"></a>

Having explained the problem to solve and described the type of dataset that will be used, the next section will include the following steps:
* Acquisition of the data by using Foursqaure in our case;
* Present an overview of data
* Perform some exploratory Data Analysis (EDA) to find some pattern in dataset and have an idea of  what kind of useful Machine learning can be suitable  to solve problem
* Build a model
* Test and explain the result.

### Data  Acquition <a name ="data_acquisition"></a>

As mentioned in `Data` section, I do not have an existing dataset to solve the mentioned problem. To collect the necessary datasets, ** Foursquare Api** will be used.

In order to define an instance of the geocoder, we need to define a `user_agent`. We will name our agent `foursquare_agent` and define the `address` and the `search_query`, which will be associated with a valid `Foursquare` developer credentials. The  user credentials and  Foursquare Api are in the following forms:

In [1]:
#CLIENT_ID = 'your-client-ID' # your Foursquare ID
#CLIENT_SECRET = 'your-client-secret' # your Foursquare Secret
#VERSION = '20180604'
#LIMIT = 30
#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)


> `https://api.foursquare.com/v2/venues/`**search**`?client_id=`**CLIENT_ID**`&client_secret=`**CLIENT_SECRET**`&ll=`**LATITUDE**`,`**LONGITUDE**`&v=`**VERSION**`&query=`**QUERY**`&radius=`**RADIUS**`&limit=`**LIMIT**

In order to find the location of a specific point of interst, the following code blocks have been used.

In [2]:
# This function will get the location of a specific region of interest POI.
def get_address(address):
    geolocator = Nominatim(user_agent="foursquare_agent")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    print(latitude, longitude)

Assuming that we are living in **Aalborg City** and we  need the near  **Rema1000** supermarket, one can use the following few of code to request the `json` file that contains the location of  all the near **Rema1000** supermarket.

In [None]:
address = 'Aalborg'
get_address(address) # Get the address of a specific point of interst
search_query = 'Rema1000'
radius =50000
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'
.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()

In order  to convert the `json` file  to `pandas DataFrame` and filter the venues, the following function can be used

In [None]:
def format_requests_result(results):
    # keep only columns that include venue name, and anything that is associated with location
    # assign relevant part of JSON to venues
    venues = results['response']['venues']
    # tranform venues into a dataframe
    dataframe = json_normalize(venues)
    filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
    dataframe_filtered = dataframe.loc[:, filtered_columns]
    # function that extracts the category of the venue
    def get_category_type(row):
        try:
            categories_list = row['categories']
        except:
            categories_list = row['venue.categories']
        if len(categories_list) == 0:
            return None
        else:
            return categories_list[0]['name']
    # filter the category for each row
    dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)
    # clean column names by keeping only last term
    dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]
    dataframe_filtered = dataframe_filtered.drop(columns = ['formattedAddress','country', 'id','state', 'cc','labeledLatLngs'])
    return dataframe_filtered

After the above-mentioned steps, one can get the  final dataset that can be used to investigate and solve the problm. The final dataset that contains **Lidl** and it's competitors. 

In [None]:
df = rema_lidl_føtex_df.append(fakta_df, ignore_index= True, sort= False)
df.drop('crossStreet', axis =1, inplace = True)  # remove this column because not associated with data
df.dropna( axis = 0, inplace = True)
df.head()

### Data  Analysis and Visualization<a name ="data_analysis"></a>

After Data wrangling or Data munging, I will now perform some exploratory Data analysis (EDA) and visualization for a better understanding dataset and try to find which Machine Learning algorithm can be suitable to for this kind of problem.

In [None]:
df_num_of_supermarket = df.groupby(['name'])['postalCode'].count().reset_index()
df_num_of_supermarket

<img src = "lidl_1.jpg">

Surprisingly, after a simple grouping by postalCol I realized that there are more `REMA 1000` in Aalborg city than any other supermarket while there is only **9 Lidl** supermarket. To find out where they are located, one can perform the following query

In [None]:
lidl_filtered_df = df[df['name'] == 'Lidl']
lidl_filtered_df

<img src = "lidl_super_table.jpg">

One can read from the table above that there is only **One (1) Lidl** store in the main city **Aalborg**,**5 of the 8** remaining stores are located very far from Aalborg center. The reasons of this choice are multiples and We will enumerate some of them in next sections. One can also Visualize all supermarkets in **Nordjylland** 

<img src = "lidl_super_fig1.jpg">

Number of ** LiDL **  in Aalbrog are presented in the following figure:
<img src ="lidl_super_table2.jpg">

From this graph one can understand that **LIDL** have decided to build only one **Supermarket** per neighbourhood (by Postal code).The reason of this choice can be very difficult to understand specially where in main ** City ** where we can find only **One(1) LIDl** of 17.

The number of **supermarket** in Aalborg city.

<img src ="aalb_sup.jpg">

<img src ="aau_city_sup.jpg">

Now one can   Create a map of Aalborg with The Suppermarkets their locations and postalCode of neighbourhood  using **folium**

<img src ="aau_supermarkets_map.jpg">

Now we  try to isolate **Lidl** and see the different locations on map and the distance from the main  center **City Aalborg**

<img src ="lidl_aau.jpg">

### Modelling  <a name ="model"></a>

Having a better understanding dataset, an unsupervised Machine learning is a suitable to find to solve this problem. Therefore, in next step, I will build  a **k-means** algorithm using the loaction data and the distance. For this purpose I will filter the dataset and use the following  methods

In [None]:
from sklearn.cluster import KMeans
kclusters = 4 #I choose the number of cluster to see how the 4 supermarket are presented in Aalborg

clust_data = df.drop(['name','city', 'categories', 'address', 'postalCode'], axis = 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(clust_data)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]


Let's create a new dataframe that includes the cluster to the dataset.

In [None]:
df['cluster_labels'] = kmeans.labels_
df.head()

Using folium  on can present the different clusters.

<img src ="cluster_1.jpg">

### Examine Clusters

Now we can examene the clusters as the following:

In [None]:
df.loc[df['cluster_labels'] == 0] # this the dataframe for the first cluster and the Supermarkets assigned to it.

In [None]:
df.loc[df['cluster_labels'] == 1] # this the dataframe for the second cluster and the Supermarkets assigned to it.

In [None]:
df.loc[df['cluster_labels'] == 2] # this the dataframe for the third cluster and the Supermarkets assigned to it.

In [None]:
df.loc[df['cluster_labels'] == 3] # this the dataframe for the fourth cluster and the Supermarkets assigned to it.

## Results and Prediction  <a name ="result"></a>
The cluster analysis show that many of **Supermarket** are located in the main City **Aalborg** but this is not the case when we consider **Lidl**, which is our target for this project. When we look at the first ** Cluster** which represent the group of clusters gethered in main City with  Postal code between ** 9000 to 9430**. There are  ** 4 Lidl ** assigned to this cluster and only one of them is located in main center with **PostalCode 9000**. I have also discovered that  each **Lidl supermarket** has different location very far from each other.
According to the results from the different data analysis combined with  cluster analysis, ** I Will advise Lidl Group** to build the next **Lidl** supermarket in the main city area (postalCode 9000) near by the competitors, which are already concentrated in this area. Of course the completion will be very hard but the quality and price difference will play in favor of **Lidl**. Another factor that can play in favor of **Lidl** is that it's more international in contrast of its competitors which are more national based.
As the population is growing in main city,**Lidl** group can also can also think about how to increase the customers in main city.

## Conlusion and Discussions <a name ="conclusion"></a>
The main purpose of this project was to investigate where ** Lidl Goup ** will build the next **Lidl** supermarket in order to help stakeholders in their decision to find the optimal location for the next **Lidl** supermarket in **Nordjylland**. Having an EDA and clustering according to the different locations and the their distance to city center I came to the conclusion that the next **Lidl** should find a place closer to city center. However, the final decision will made by the different stakeholders based on specific characteristics that are not investigated in this project.


## Further Development <a name = "further"></a>
The following are suggestions about how this project could be further implemented:
1. Integrate communal plan with the growth of population 
2. Using more automated AI tools form better performance of the algorithm