#**Interactive Data Processing and Visualization using NumPy and Pandas**
This project aims to use NumPy for data cleaning, reading, and processing, focusing on the latitude and longitude columns included in our dataset.



## Downloading the Dataset

Downloading some prerequisite packages in order to run all the code below.

In [1]:
%%capture
!pip install numpy pandas streamlit gdown currencyconverter

In [2]:
import numpy as np

# For readability purposes, we will disable scientific notation for numbers
np.set_printoptions(suppress=True)

In [3]:
import os
import shutil

import gdown
from numpy import genfromtxt

# Download file from Google Drive
# This file is based on data from: http://insideairbnb.com/get-the-data/
file_id_1 = "13fyESiH1ZEnMV6eabAyhe20t4W6peEWK"
downloaded_file_1 = "WK1_Airbnb_Amsterdam_listings_proj.csv"

# Download the file from Google Drive
gdown.download(id=file_id_1, output=downloaded_file_1)

Downloading...
From: https://drive.google.com/uc?id=13fyESiH1ZEnMV6eabAyhe20t4W6peEWK
To: /content/WK1_Airbnb_Amsterdam_listings_proj.csv
100%|██████████| 246k/246k [00:00<00:00, 15.9MB/s]


'WK1_Airbnb_Amsterdam_listings_proj.csv'

## Preprocessing the Dataset


#### Task 1: Finding delimiter

In [4]:
from numpy import genfromtxt

my_data = genfromtxt("/content/WK1_Airbnb_Amsterdam_listings_proj.csv", delimiter="|",dtype="unicode")

In [5]:
my_data[:,:4]

array([['', '0', '1', '2'],
       ['id', '23726706', '35815036', '31553121'],
       ['price', '$88.00', '$105.00', '$152.00'],
       ['latitude', '52.34916', '52.42419', '52.43237'],
       ['longitude', '4.97879', '4.95689', '4.91821']], dtype='<U18')

#### Task 2: Clean the data


In [6]:
# Remove the first column and row
matrix = my_data[1:,1:]
matrix[:,:4]


array([['23726706', '35815036', '31553121', '34745823'],
       ['$88.00', '$105.00', '$152.00', '$87.00'],
       ['52.34916', '52.42419', '52.43237', '52.2962'],
       ['4.97879', '4.95689', '4.91821', '5.01231']], dtype='<U18')

#### Task 3:Shifting the matrix


In [7]:
# Shift the matrix by 90 degrees
matrix =  matrix.T
matrix[:5,:]

array([['23726706', '$88.00', '52.34916', '4.97879'],
       ['35815036', '$105.00', '52.42419', '4.95689'],
       ['31553121', '$152.00', '52.43237', '4.91821'],
       ['34745823', '$87.00', '52.2962', '5.01231'],
       ['44586947', '$160.00', '52.31475', '5.0303']], dtype='<U18')

#### Task 4: Removing special characters


In [8]:
# Remove the dollar sign
matrix = np.char.replace(matrix, "$", "")

# Remove the comma
matrix = np.char.replace(matrix,",","")

Now the dataset contains only numerical values allowing us to perform numerical operations.

#### Task 5: Verifying the matrix



In [9]:
# Check if the dollar sign is in our dataset
matrix[np.char.find(matrix,"$") > -1]

array([], dtype='<U18')

In [10]:
matrix[np.char.find(matrix,",") > -1]

array([], dtype='<U18')

#### Task 6: Data type



Enabling numerical operations (calculations) by change the `dtype` from string/Unicode characters to [float of 32-bit precision]


In [11]:
# Change Unicode to float32
matrix = matrix.astype(np.float32)
matrix[:5,:]

array([[23726706.     ,       88.     ,       52.34916,        4.97879],
       [35815036.     ,      105.     ,       52.42419,        4.95689],
       [31553120.     ,      152.     ,       52.43237,        4.91821],
       [34745824.     ,       87.     ,       52.2962 ,        5.01231],
       [44586948.     ,      160.     ,       52.31475,        5.0303 ]],
      dtype=float32)

## Setting up the price

Currency from USD to INR

In [12]:
from currency_converter import CurrencyConverter

cc = CurrencyConverter()

matrix[:5,:]

array([[23726706.     ,       88.     ,       52.34916,        4.97879],
       [35815036.     ,      105.     ,       52.42419,        4.95689],
       [31553120.     ,      152.     ,       52.43237,        4.91821],
       [34745824.     ,       87.     ,       52.2962 ,        5.01231],
       [44586948.     ,      160.     ,       52.31475,        5.0303 ]],
      dtype=float32)

In [13]:
matrix[:,1]

array([ 88., 105., 152., ..., 180., 174.,  65.], dtype=float32)

#### Task 7: Pick any currency
Picking INR currency from the list

In [14]:
cc.currencies

{'AUD',
 'BGN',
 'BRL',
 'CAD',
 'CHF',
 'CNY',
 'CYP',
 'CZK',
 'DKK',
 'EEK',
 'EUR',
 'GBP',
 'HKD',
 'HRK',
 'HUF',
 'IDR',
 'ILS',
 'INR',
 'ISK',
 'JPY',
 'KRW',
 'LTL',
 'LVL',
 'MTL',
 'MXN',
 'MYR',
 'NOK',
 'NZD',
 'PHP',
 'PLN',
 'ROL',
 'RON',
 'RUB',
 'SEK',
 'SGD',
 'SIT',
 'SKK',
 'THB',
 'TRL',
 'TRY',
 'USD',
 'ZAR'}

In [15]:
# Get the rate of conversaton from the US dollar to Indian Rupees
INR_rate = cc.convert(1,"USD","INR")

# Multiply the dollar column by your currency of choice
matrix[:, 1] = matrix[:,1] * INR_rate

#### Task 8: Inflation?


Recent inflation all around the world has caused many companies to raise their prices. Consequently, Airbnb listings have also raised their prices by a certain amount. So using 7% as value and applying this inflation rate to the newly updated prices.

In [16]:
# Multiply the dollar column by the inflation percentage (1.00 + inflation)
matrix[:,1] = matrix[:,1] * 1.07
matrix[:,1]

array([ 7884.301,  9407.404, 13618.337, ..., 16126.979, 15589.413,
        5823.631], dtype=float32)

#### Task 9: Rounding up decimals


In [17]:
# Round down the new currency column to 2 decimals
matrix[:,1] = np.round(matrix[:,1], 2)
matrix[:5,:]

array([[23726706.     ,     7884.3    ,       52.34916,        4.97879],
       [35815036.     ,     9407.4    ,       52.42419,        4.95689],
       [31553120.     ,    13618.34   ,       52.43237,        4.91821],
       [34745824.     ,     7794.71   ,       52.2962 ,        5.01231],
       [44586948.     ,    14335.09   ,       52.31475,        5.0303 ]],
      dtype=float32)

#### Task 10: Choose your location

Look up a place you'd like to visit in Amsterdam's city center, along with its longitude and latitude. We want to save this for choosing an Airbnb listing to our liking.




In [18]:
# Favorite location. List your coordiates as floats.
# Ex from above: latitude = 52.3600, longitude = 4.8852
latitude = 52.3708
longitude =  4.9030

## Listing All Listings

<center>
  <img src=https://images0.persgroep.net/rcs/vnd5KBhggcKV72YJjpLWH_-xljU/diocontent/131036963/_crop/34/170/1378/778/_fitwidth/763?appId=93a17a8fd81db0de025c8abd1cca1279&quality=0.8&desiredformat=webp width="500" align="center" />
</center>
<br/>

Imagine Airbnb Amsterdam decided to deviate from Airbnb Global and provide a feature on their website that showed the best listings for you based on the locations you were planning to visit. Wouldn't it make sense to choose a place to stay in a location closest to where you're likely to go most often?

You will limit your results to your favorite location in Amsterdam (as chosen above) and the surrounding available Airbnb listings using math and NumPy.



In [19]:
import math

def from_location_to_airbnb_listing_in_meters(lat1: float, lon1: float, lat2: list, lon2: list):
    # Source: https://community.esri.com/t5/coordinate-reference-systems-blog
    # /distance-on-a-sphere-the-haversine-formula/ba-p/902128

    R = 6371000  # Radius of Earth in meters
    phi_1 = math.radians(lat1)
    phi_2 = math.radians(lat2)

    delta_phi = math.radians(lat2 - lat1)
    delta_lambda = math.radians(lon2 - lon1)

    a = (
        math.sin(delta_phi / 2.0) ** 2
        + math.cos(phi_1) * math.cos(phi_2) * math.sin(delta_lambda / 2.0) ** 2
    )

    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))

    meters = R * c  # Output distance in meters

    return round(meters, 0)

#### Task 11: Loop or vectorize!


In [20]:
# Create a loop or vectorized way to calculate the distance,
conv_to_meters = np.vectorize(from_location_to_airbnb_listing_in_meters)

# going over all latitude and longitude entries in the dataset
conv_to_meters(latitude, longitude, matrix[:, 2], matrix[:, 3])


array([5681., 6972., 6924., ..., 6165., 6238., 5083.])

In [21]:
#using `timeit` function to see how quickly the code is executed
%%timeit -r 4 -n 100

# Allow a Python function to be used in a (semi-)vectorized way
conv_to_meters = np.vectorize(from_location_to_airbnb_listing_in_meters)

# Apply the function, use timeit
conv_to_meters(latitude, longitude, matrix[:, 2], matrix[:, 3])

44.2 ms ± 8.83 ms per loop (mean ± std. dev. of 4 runs, 100 loops each)


Optimization is possible but is always a trade-off between the 'need for speed' and the 'need for delivery' of your results.

####Task 12: Through Numpy


In [22]:
def from_location_to_airbnb_listing_in_meters(
    lat1: float, lon1: float, lat2: np.ndarray, lon2: np.ndarray
):
    R = 6371000  # radius of Earth in meters
    phi_1 = np.radians(lat1)
    phi_2 = np.radians(lat2)

    delta_phi = np.radians(lat2 - lat1)
    delta_lambda = np.radians(lon2 - lon1)

    a = (
        np.sin(delta_phi / 2.0) ** 2
        + np.cos(phi_1) * np.cos(phi_2) * np.sin(delta_lambda / 2.0) ** 2
    )

    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))

    meters = R * c  # output distance in meters

    return np.round(meters, 0)

####Task 13: How much faster is it now?


In [23]:
%%timeit -r 4 -n 100

from_location_to_airbnb_listing_in_meters(
    latitude, longitude, matrix[:, 2], matrix[:, 3]
)

848 µs ± 197 µs per loop (mean ± std. dev. of 4 runs, 100 loops each)


There is a massive speed-up just by switching your functions from default Python functions to their NumPy variants.

---

## Prep the Dataset for Download!


Now that we've created a function to calculate the distance in meters for every Airbnb listing, we'll perform this calculation on the entire dataset and add the outputs to the matrix as a new column.

Next to that, we'll add another column that contains only ones and zeros to represent the "color" of an entry/row. This column can be used later if you want to turn this dataset into an app using [Streamlit](https://streamlit.io/).



In [24]:
# Run the previous method
meters = from_location_to_airbnb_listing_in_meters(
    latitude, longitude, matrix[:, 2], matrix[:, 3]
)

# Add an axis to make concatenation possible
meters = meters.reshape(-1, 1)

# Append the distance in meters to the matrix
matrix = np.concatenate((matrix, meters), axis=1)

In [25]:
# Append a color to the matrix
colors = np.zeros(meters.shape)
matrix = np.concatenate((matrix, colors), axis=1)

# Append our entry to the matrix
fav_entry = np.array([1, 0, 52.3708, 4.9030, 0, 1]).reshape(1, -1) # Change coordinates to your favorite location
matrix = np.concatenate((fav_entry, matrix), axis=0)

# Entries: airbnb_id, price, latitude, longitude,
# meters from favorite point, color
matrix[:5, :]

array([[       1.        ,        0.        ,       52.3708    ,
               4.903     ,        0.        ,        1.        ],
       [23726706.        ,     7884.29980469,       52.34915924,
               4.97878981,     5681.        ,        0.        ],
       [35815036.        ,     9407.40039062,       52.42419052,
               4.95689011,     6972.        ,        0.        ],
       [31553120.        ,    13618.33984375,       52.43236923,
               4.91821003,     6924.        ,        0.        ],
       [34745824.        ,     7794.70996094,       52.2961998 ,
               5.01231003,    11134.        ,        0.        ]])

In [26]:
# Export the data to use in the primer for next week
np.savetxt("WK1_Airbnb_Amsterdam_listings_proj_solution.csv", matrix, delimiter=",")

In [27]:
from google.colab import files

# Download the file locally
files.download('WK1_Airbnb_Amsterdam_listings_proj_solution.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [28]:
%%writefile streamlit_app.py
import pandas as pd
import plotly.express as px
import streamlit as st

# Display title and text
st.title("Data and visualization")
st.markdown("Here we can see the dataframe created during this project.")

# Read dataframe
dataframe = pd.read_csv(
    "WK1_Airbnb_Amsterdam_listings_proj_solution.csv",
    names=[
        "Airbnb Listing ID",
        "Price",
        "Latitude",
        "Longitude",
        "Meters from chosen location",
        "Location",
    ],
)

# We have a limited budget, therefore we would like to exclude
# listings with a price above 25000 rupees per night
dataframe = dataframe[dataframe["Price"] <= 25000]

# Display as integer
dataframe["Airbnb Listing ID"] = dataframe["Airbnb Listing ID"].astype(int)
# Round of values
dataframe["Price"] = "₹ " + dataframe["Price"].round(2).astype(str) 
# Rename the number to a string
dataframe["Location"] = dataframe["Location"].replace(
    {1.0: "To visit", 0.0: "Airbnb listing"}
)

# Display dataframe and text
st.dataframe(dataframe)
st.markdown("Below is a map showing all the Airbnb listings with a red dot and the location we've chosen with a blue dot.")

# Create the plotly express figure
fig = px.scatter_mapbox(
    dataframe,
    lat="Latitude",
    lon="Longitude",
    color="Location",
    color_discrete_sequence=["blue", "red"],
    zoom=11,
    height=500,
    width=800,
    hover_name="Price",
    hover_data=["Meters from chosen location", "Location"],
    labels={"color": "Locations"},
)
fig.update_geos(center=dict(lat=dataframe.iloc[0][2], lon=dataframe.iloc[0][3]))
fig.update_layout(mapbox_style="stamen-terrain")

# Show the figure
st.plotly_chart(fig, use_container_width=True)

Writing streamlit_app.py


In [29]:
from google.colab import files

# Download the file locally
files.download('streamlit_app.py')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [30]:
%%writefile requirements.txt
pandas
streamlit
plotly

Writing requirements.txt


In [31]:
from google.colab import files

# Download the file locally
files.download('requirements.txt')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>