## Data Analysis with Python: Zero to Pandas - Course Project Guidelines
#### (remove this cell before submission)
Important links:
- Make submissions here: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas/assignment/course-project
- Ask questions here: https://jovian.ml/forum/t/course-project-on-exploratory-data-analysis-discuss-and-share-your-work/11684
- Find interesting datasets here: https://jovian.ml/forum/t/recommended-datasets-for-course-project/11711
This is the starter notebook for the course project for [Data Analysis with Python: Zero to Pandas](https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas). You will pick a real-world dataset of your choice and apply the concepts learned in this course to perform exploratory data analysis. Use this starter notebook as an outline for your project . Focus on documentation and presentation - this Jupyter notebook will also serve as a project report, so make sure to include detailed explanations wherever possible using Markdown cells.
### Evaluation Criteria
Your submission will be evaluated using the following criteria:
* Dataset must contain at least 3 columns and 150 rows of data
* You must ask and answer at least 4 questions about the dataset
* Your submission must include at least 4 visualizations (graphs)
* Your submission must include explanations using markdown cells, apart from the code.
* Your work must not be plagiarized i.e. copy-pasted for somewhere else.
Follow this step-by-step guide to work on your project.
### Step 1: Select a real-world dataset 
- Find an interesting dataset on this page: https://www.kaggle.com/datasets?fileType=csv
- The data should be in CSV format, and should contain at least 3 columns and 150 rows
- Download the dataset using the [`opendatasets` Python library](https://github.com/JovianML/opendatasets#opendatasets)
Here's some sample code for downloading the [US Elections Dataset](https://www.kaggle.com/tunguz/us-elections-dataset):
```
import opendatasets as od
dataset_url = 'https://www.kaggle.com/tunguz/us-elections-dataset'
od.download('https://www.kaggle.com/tunguz/us-elections-dataset')
```
You can find a list of recommended datasets here: https://jovian.ml/forum/t/recommended-datasets-for-course-project/11711
### Step 2: Perform data preparation & cleaning
- Load the dataset into a data frame using Pandas
- Explore the number of rows & columns, ranges of values etc.
- Handle missing, incorrect and invalid data
- Perform any additional steps (parsing dates, creating additional columns, merging multiple dataset etc.)
### Step 3: Perform exploratory analysis & visualization
- Compute the mean, sum, range and other interesting statistics for numeric columns
- Explore distributions of numeric columns using histograms etc.
- Explore relationship between columns using scatter plots, bar charts etc.
- Make a note of interesting insights from the exploratory analysis
### Step 4: Ask & answer questions about the data
- Ask at least 4 interesting questions about your dataset
- Answer the questions either by computing the results using Numpy/Pandas or by plotting graphs using Matplotlib/Seaborn
- Create new columns, merge multiple dataset and perform grouping/aggregation wherever necessary
- Wherever you're using a library function from Pandas/Numpy/Matplotlib etc. explain briefly what it does
### Step 5: Summarize your inferences & write a conclusion
- Write a summary of what you've learned from the analysis
- Include interesting insights and graphs from previous sections
- Share ideas for future work on the same topic using other relevant datasets
- Share links to resources you found useful during your analysis
### Step 6: Make a submission & share your work
- Upload your notebook to your Jovian.ml profile using `jovian.commit`.
- **Make a submission here**: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas/assignment/course-project
- Share your work on the forum: https://jovian.ml/forum/t/course-project-on-exploratory-data-analysis-discuss-and-share-your-work/11684
- Browse through projects shared by other participants and give feedback
### (Optional) Step 7: Write a blog post
- A blog post is a great way to present and showcase your work.  
- Sign up on [Medium.com](https://medium.com) to write a blog post for your project.
- Copy over the explanations from your Jupyter notebook into your blog post, and [embed code cells & outputs](https://medium.com/jovianml/share-and-embed-jupyter-notebooks-online-with-jovian-ml-df709a03064e)
- Check out the Jovian.ml Medium publication for inspiration: https://medium.com/jovianml
### Example Projects
Refer to these projects for inspiration:
* [Analyzing StackOverflow Developer Survey Results](https://jovian.ml/aakashns/python-eda-stackoverflow-survey)
* [Analyzing Covid-19 data using Pandas](https://jovian.ml/aakashns/python-pandas-data-analysis) 
* [Analyzing your browser history using Pandas & Seaborn](https://medium.com/free-code-camp/understanding-my-browsing-pattern-using-pandas-and-seaborn-162b97e33e51) by Kartik Godawat
* [WhatsApp Chat Data Analysis](https://jovian.ml/PrajwalPrashanth/whatsapp-chat-data-analysis) by Prajwal Prashanth
* [Understanding the Gender Divide in Data Science Roles](https://medium.com/datadriveninvestor/exploratory-data-analysis-eda-understanding-the-gender-divide-in-data-science-roles-9faa5da44f5b) by Aakanksha N S
* [2019 State of Javscript Survey Results](https://2019.stateofjs.com/demographics/)
* [2020 Stack Overflow Developer Survey Results](https://insights.stackoverflow.com/survey/2020)
**NOTE**: Remove this cell containing the instructions before making your submission. You can do using the "Edit > Delete Cells" menu option.

## Selecting a Real World Data Set.

The dataset that has been selected is an interesting one on kaggle [ny-rental-properties-pricing](http://localhost:8888/tree/mts%20data/price-volume-data-for-all-us-stocks-etfs), i have selected this due to my interest of recent in real estate.

The dataset is in CSV(comma separated values) form, and will be downloaded using the `opendataset` python library.

In [1]:
#installing opendataset.
!pip install opendatasets

You should consider upgrading via the '/data/user/0/ru.iiec.pydroid3/files/arm-linux-androideabi/bin/python3.9 -m pip install --upgrade pip' command.[0m


In [60]:
!pip install openpyxl 


Collecting openpyxl
  Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
[K     |████████████████████████████████| 249 kB 198 kB/s eta 0:00:01
[?25hCollecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2
You should consider upgrading via the '/data/user/0/ru.iiec.pydroid3/files/arm-linux-androideabi/bin/python3.9 -m pip install --upgrade pip' command.[0m


In [2]:
#import opendataset for use as "od"
import opendatasets as od

In [3]:
 #Downloading the data set from kaggle.

 #Step:
 #get the link address and download using "od".

dataset_url = 'https://www.kaggle.com/datasets/ivanchvez/ny-rental-properties-pricing'
od.download(dataset_url) 

#dataset_url2 = 'https://www.kaggle.com/datasets/borismarjanovic/price-volume-data-for-all-us-stocks-etfs'
#od.download(dataset_url2) 

dataset_url3 = 'https://www.kaggle.com/datasets/new-york-city/nyc-dof-condominium-comparable-rental-income'
od.download(dataset_url3)

dataset_url4 = 'https://www.kaggle.com/datasets/ebrahimelgazar/new-york-city-airbnb-market'
od.download(dataset_url4)


Skipping, found downloaded files in "./ny-rental-properties-pricing" (use force=True to force download)
Skipping, found downloaded files in "./nyc-dof-condominium-comparable-rental-income" (use force=True to force download)
Skipping, found downloaded files in "./new-york-city-airbnb-market" (use force=True to force download)


# Data Preparation and Cleaning

Installing and importing the essensial libraries for data manipulation.

- `urlretrieve`: for retrieving data sets from link addresses into a CSV file.
- `numpy`: Python library for Numerical computing.
- `pandas`: Python library for working with data, stores data Dataframe.


In [4]:
#import `urlretrieve` for retrieving downloaded dataset.
from urllib.request import urlretrieve 
#import data manipulation tools.
import pandas as pd 
import numpy as np 


### Loading Dataset into a Pandas Dataframe.

The data being in `CSV` format can be easily loaded into a dataframe by using the `pd.read_csv` method.

`NY_rental_df` is the dataframe name I will be using to hold this dataset first-hand.

In [5]:
#load CSV format dataset into a pandas dataframe.
NY_rental_df = pd.read_csv('NY Realstate Pricing.csv')


### Exploring the Number of Rows, Columns, Ranges of values, e.t.c. 

**-Number of Rows and Column.**

Displaying the dataset in a pandas dataframe by calling the dataframe name holding the loaded dataset, we can get some basic knowledge of the number of rows and column.

In [6]:
NY_rental_df

Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
0,0,2595,Midtown,40.75362,-73.98377,Entire home/apt,225,15,10,48,0.39,1
1,1,3831,Brooklyn,40.68514,-73.95976,Entire home/apt,89,188,1,295,4.67,1
2,2,5099,Manhattan,40.74767,-73.97500,Entire home/apt,200,362,3,78,0.60,19
3,3,5121,Bedford-Stuyvesant,40.68688,-73.95596,Private room,60,0,29,49,0.38,365
4,4,5178,Manhattan,40.76489,-73.98493,Private room,79,141,2,454,3.52,242
...,...,...,...,...,...,...,...,...,...,...,...,...
17609,28313,23691588,Brooklyn,40.69312,-73.94073,Shared room,32,9,31,5,0.26,1
17610,17415,14712466,Brooklyn,40.65446,-73.92613,Shared room,99,7,100,1,0.03,0
17611,27827,23184420,Lower East Side,40.71172,-73.98864,Shared room,41,14,180,2,0.12,365
17612,29127,24555212,Manhattan,40.71113,-73.98840,Shared room,38,0,180,1,0.27,365


The dataframe above shows the number of rows(records) and columns(features) in the dataset 

We have `17614` rows and `12` columns. The first row is indexed 0 and the last row is indexed `17613` due to `indexing`. 

Note: The index column is not included in the number of columns.


Let's get an overall information about the dataset using the `.info` method.

In [7]:
NY_rental_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17614 entries, 0 to 17613
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   F1                     17614 non-null  int64  
 1   id                     17614 non-null  int64  
 2   neighbourhood          17614 non-null  object 
 3   latitude               17614 non-null  float64
 4   longitude              17614 non-null  float64
 5   room_type              17614 non-null  object 
 6   price                  17614 non-null  int64  
 7   days_occupied_in_2019  17614 non-null  int64  
 8   minimum_nights         17614 non-null  int64  
 9   number_of_reviews      17614 non-null  int64  
 10  reviews_per_month      17614 non-null  float64
 11  availability_2020      17614 non-null  int64  
dtypes: float64(3), int64(7), object(2)
memory usage: 1.5+ MB


**-Storage and Data type.**

So the dataset consists of columns that contain `float`, `integers` and `object` as data types. And about `1.5MB` of the memory is used for the dataset's storage.
`17614` entries and `12` columns.

**-Range of Values**

We might want to explore the range of `id` and other numerical features of the records in the New York rental dataset.

This can be done efficiently by
using the `.describe` method.

We will be getting range of values for the columns:

In [8]:
# Get the columns.
NY_rental_df.columns

Index(['F1', 'id', 'neighbourhood', 'latitude', 'longitude', 'room_type',
       'price', 'days_occupied_in_2019', 'minimum_nights', 'number_of_reviews',
       'reviews_per_month', 'availability_2020'],
      dtype='object')

In [9]:
NY_rental_df.describe() 


Unnamed: 0,F1,id,latitude,longitude,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
count,17614.0,17614.0,17614.0,17614.0,17614.0,17614.0,17614.0,17614.0,17614.0,17614.0
mean,18547.564664,15720320.0,40.726755,-73.947732,145.45549,179.517656,7.392926,56.128988,1.60706,154.154763
std,11000.717341,9644155.0,0.056981,0.050213,194.990677,130.202015,19.233869,65.97237,1.635528,138.079651
min,0.0,2595.0,40.50868,-74.23986,0.0,0.0,1.0,1.0,0.01,0.0
25%,8192.25,6718288.0,40.686042,-73.980938,70.0,35.0,2.0,9.0,0.34,8.0
50%,19496.5,16546990.0,40.72054,-73.95305,109.0,198.0,3.0,33.0,1.06,125.0
75%,28686.75,24077070.0,40.763127,-73.930682,170.0,301.0,5.0,79.0,2.46,309.0
max,35596.0,30565280.0,40.90804,-73.72179,9999.0,364.0,1125.0,675.0,19.25,365.0


This table shows the `count( number of entries for each numerical column)`, `mean`, `standard deviation (std)`, `minimum value`, `maximum value`, `25th percentile (25%)`, `50th percentile (50%)`, and `75th percentile (70%)`.

Now to our non-numerical columns, `neighbourhood` and `room_type`. 
we'll get the variety of objects that is contained in this column using the `.unique` method.

In [10]:
# Range of values for neighborhood.
Neighbourhoods = NY_rental_df["neighbourhood"].unique()
Neighbourhoods


array(['Midtown', 'Brooklyn', 'Manhattan', 'Bedford-Stuyvesant',
       'Lower East Side', 'Park Slope', 'Williamsburg', 'East Village',
       'Harlem', 'Hamilton Heights', 'Bushwick', 'Alphabet City',
       'Flatbush', 'Long Island City', 'Clinton Hill', 'Fort Greene',
       'Upper West Side', 'Greenpoint', 'Kips Bay', "Hell's Kitchen",
       'East Harlem', 'Queens', 'Meatpacking District',
       'Brooklyn Heights', 'Prospect Heights', 'Chelsea',
       'Carroll Gardens', 'West Village', 'Gowanus', 'Lefferts Garden',
       'Flatlands', 'Kew Garden Hills', 'Upper East Side', 'Sunnyside',
       'DUMBO', 'Staten Island', 'Highbridge', 'Ridgewood', 'Jamaica',
       'Middle Village', 'Cobble Hill', 'Roosevelt Island', 'Soho',
       'West Brighton', 'Eastchester', 'Crown Heights',
       'Morningside Heights', 'Chinatown', 'Red Hook',
       'Kingsbridge Heights', 'The Rockaways', 'Midtown East',
       'Forest Hills', 'The Bronx', 'Washington Heights', 'Astoria',
       'Baycheste

This is a Numpy Array of the neighbourhoods we have neighbourhoods like Mid Town, Van Nest and much more. 
To get the `no. of unique neighourhoods.` the `.size` attribute of Numpy arrays will aid.

In [11]:
# Number of unique neighborhoods.
No_Neighbhd = Neighbourhoods.size

# Total number of entries in neighborhood column.
Neighbhd_entry = NY_rental_df.neighbourhood.size

# Range of values.
print(f"The neighborhood column contains {No_Neighbhd} unique neighborhoods and {Neighbhd_entry} entries.")


The neighborhood column contains 186 unique neighborhoods and 17614 entries.


In [12]:
# Room types.
Room_types = NY_rental_df["room_type"].unique() 
Room_types


array(['Entire home/apt', 'Private room', 'Shared room', 'Hotel room'],
      dtype=object)

In [13]:
# No. of entries for type_room column.
type_room_entries = NY_rental_df["room_type"].size 

# Unique room types 
type_rooms = list(Room_types) 

# Range of values
print(f"The room type contains `{Room_types.size}` unique rooms namely `{type_rooms}` and have `{type_room_entries}` entries.") 


The room type contains `4` unique rooms namely `['Entire home/apt', 'Private room', 'Shared room', 'Hotel room']` and have `17614` entries.


**EVALUATION:
This dataset does not seem to contain invalid values or any missing values.**

All columns have 17614 entries and the figures don't seem out of place. 

In [14]:
# Get range of values for id in New York.
NY_rental_df.sort_values("id").head(5) 


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
0,0,2595,Midtown,40.75362,-73.98377,Entire home/apt,225,15,10,48,0.39,1
1,1,3831,Brooklyn,40.68514,-73.95976,Entire home/apt,89,188,1,295,4.67,1
2,2,5099,Manhattan,40.74767,-73.975,Entire home/apt,200,362,3,78,0.6,19
3,3,5121,Bedford-Stuyvesant,40.68688,-73.95596,Private room,60,0,29,49,0.38,365
4,4,5178,Manhattan,40.76489,-73.98493,Private room,79,141,2,454,3.52,242


In [15]:
NY_rental_df.sort_values("id").tail(5) 


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
12677,35587,30561918,Brooklyn,40.64954,-74.00166,Private room,65,192,2,3,1.32,177
1593,35588,30561958,Lower East Side,40.72149,-73.98918,Entire home/apt,99,353,1,3,0.27,0
1266,35592,30562589,Brooklyn,40.64487,-73.94711,Entire home/apt,70,351,1,23,1.91,0
12994,35594,30565063,Bushwick,40.70279,-73.91347,Private room,55,333,2,2,0.18,0
1536,35596,30565284,Jamaica,40.65861,-73.73672,Entire home/apt,135,8,1,6,0.52,85


It's seems that the `id` ranges in value from `2,595` to `30,565,284`.


In [16]:
# Get range of values for F1.
NY_rental_df.sort_values("F1").head(5)
                         

Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
0,0,2595,Midtown,40.75362,-73.98377,Entire home/apt,225,15,10,48,0.39,1
1,1,3831,Brooklyn,40.68514,-73.95976,Entire home/apt,89,188,1,295,4.67,1
2,2,5099,Manhattan,40.74767,-73.975,Entire home/apt,200,362,3,78,0.6,19
3,3,5121,Bedford-Stuyvesant,40.68688,-73.95596,Private room,60,0,29,49,0.38,365
4,4,5178,Manhattan,40.76489,-73.98493,Private room,79,141,2,454,3.52,242


In [17]:
NY_rental_df.sort_values("F1").tail(5) 


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
12677,35587,30561918,Brooklyn,40.64954,-74.00166,Private room,65,192,2,3,1.32,177
1593,35588,30561958,Lower East Side,40.72149,-73.98918,Entire home/apt,99,353,1,3,0.27,0
1266,35592,30562589,Brooklyn,40.64487,-73.94711,Entire home/apt,70,351,1,23,1.91,0
12994,35594,30565063,Bushwick,40.70279,-73.91347,Private room,55,333,2,2,0.18,0
1536,35596,30565284,Jamaica,40.65861,-73.73672,Entire home/apt,135,8,1,6,0.52,85


The `F1` for New York rental properties seem to lie in the range of `0` to `35,596`.

In [18]:
# Range of values for latitude.

NY_rental_df.sort_values("latitude").head(5)


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
940,1329,639199,Tottenville,40.50868,-74.23986,Entire home/apt,309,75,3,67,0.88,316
4176,21800,18997371,Tottenville,40.50873,-74.23914,Entire home/apt,85,213,2,53,1.86,136
6908,17315,14569577,Staten Island,40.51133,-74.23803,Entire home/apt,275,76,4,30,0.75,323
1005,20345,17554298,Annadale,40.53871,-74.16966,Entire home/apt,180,213,1,165,4.96,241
7520,18428,15797728,Staten Island,40.54106,-74.14666,Entire home/apt,70,47,5,91,2.44,152


In [19]:

NY_rental_df.sort_values("latitude").tail(5)

Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
4372,16212,13669391,Woodlawn,40.89984,-73.86902,Entire home/apt,85,38,2,58,1.41,324
14098,17941,15259634,The Bronx,40.90175,-73.89761,Private room,77,362,2,9,0.25,0
4111,29964,25476838,The Bronx,40.90329,-73.89991,Entire home/apt,105,26,2,32,1.81,95
4252,22295,19359445,Wakefield,40.90391,-73.85312,Entire home/apt,120,301,2,43,1.48,51
14004,3086,2008227,Riverdale,40.90804,-73.90005,Private room,69,33,2,154,2.14,340


The `latitude` for the rental properties in New York range from `40.50868` to `40.90804`. 

All the neighborhoods seem to have a latitude with an integer `40`.

In [20]:
# Range of values for longitude.

NY_rental_df.sort_values("longitude").head(5)

Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
940,1329,639199,Tottenville,40.50868,-74.23986,Entire home/apt,309,75,3,67,0.88,316
4176,21800,18997371,Tottenville,40.50873,-74.23914,Entire home/apt,85,213,2,53,1.86,136
6908,17315,14569577,Staten Island,40.51133,-74.23803,Entire home/apt,275,76,4,30,0.75,323
5978,30624,26258351,Staten Island,40.5479,-74.21017,Entire home/apt,75,309,3,29,1.67,2
1471,27403,22730139,Greenridge,40.56033,-74.18259,Entire home/apt,75,275,1,20,1.12,365


In [21]:

NY_rental_df.sort_values("longitude").tail(5) 


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
11138,31492,27191858,Jamaica,40.65455,-73.72671,Private room,75,0,1,2,0.14,365
9898,32586,28207954,Queens,40.73258,-73.726,Hotel room,95,186,1,7,0.62,349
7502,17677,14987516,Queens,40.65378,-73.72582,Entire home/apt,85,217,5,34,0.88,168
11664,24886,21277719,Queens,40.73138,-73.72435,Private room,55,1,1,28,1.07,350
13406,23086,19932387,Jamaica,40.73179,-73.72179,Private room,60,0,2,30,1.04,358


The `longitude` for the rental properties in New York range from `-74.23986` to `-73.72179`.

In [22]:
NY_rental_df.columns

Index(['F1', 'id', 'neighbourhood', 'latitude', 'longitude', 'room_type',
       'price', 'days_occupied_in_2019', 'minimum_nights', 'number_of_reviews',
       'reviews_per_month', 'availability_2020'],
      dtype='object')

In [23]:
# Range of values for rental price.

NY_rental_df.sort_values("price").head(5) 


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
17583,24908,21291569,Brooklyn,40.69211,-73.9067,Shared room,0,8,30,2,0.09,364
17592,24932,21304320,Bushwick,40.69166,-73.90928,Shared room,0,95,30,6,0.25,365
6120,23936,20624541,Williamsburg,40.70838,-73.94645,Entire home/apt,0,298,3,5,0.2,136
10037,23955,20639914,Bedford-Stuyvesant,40.68258,-73.91284,Private room,0,125,1,125,4.66,240
10036,23953,20639628,Bedford-Stuyvesant,40.68173,-73.91342,Private room,0,117,1,119,4.45,239


In [24]:

NY_rental_df.sort_values("price").tail(5)


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
15994,18050,15392689,Manhattan,40.80242,-73.94215,Private room,5000,14,5,1,0.1,338
1624,34635,30035166,Manhattan,40.82511,-73.94961,Entire home/apt,5000,20,1,3,0.29,165
1098,4077,2953058,Brooklyn,40.69137,-73.96723,Entire home/apt,8000,14,1,1,0.03,355
9217,27988,23377410,Manhattan,40.72197,-74.00633,Entire home/apt,8500,204,30,2,0.12,83
17226,11581,9528920,Manhattan,40.71355,-73.98507,Private room,9999,282,99,6,0.12,0


The `price` of a rental properties in New York ranges from `0$` to `9999$`.

Well `0$` because some rental apartments were leased for free or on share holding basis.


In [25]:
# Range of values for days_occupied_in_2019.

NY_rental_df.sort_values("days_occupied_in_2019").head(5)


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
10103,30318,25947875,Brooklyn,40.63573,-74.00552,Private room,60,0,1,5,0.57,349
8785,2661,1654738,Hell's Kitchen,40.76364,-73.99465,Entire home/apt,80,0,30,12,0.17,365
461,602,243229,Bedford-Stuyvesant,40.68016,-73.94878,Entire home/apt,280,0,3,5,0.05,365
17235,35149,30377718,Brooklyn,40.68595,-73.93109,Private room,37,0,100,2,0.18,0
16772,2353,1309148,Brooklyn,40.67396,-73.96083,Private room,48,0,25,4,0.05,339


In [26]:

NY_rental_df.sort_values("days_occupied_in_2019").tail(5)


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
5772,30836,26494242,Manhattan,40.72135,-73.98391,Entire home/apt,200,364,3,4,0.23,44
2055,31437,27131580,Upper West Side,40.78986,-73.96688,Entire home/apt,195,364,1,31,1.88,11
5771,17045,14253050,Manhattan,40.77371,-73.95167,Entire home/apt,150,364,3,40,0.99,0
5769,29811,25293994,Manhattan,40.76931,-73.95336,Entire home/apt,140,364,3,6,0.36,0
13004,30787,26441519,Bushwick,40.70553,-73.91938,Private room,53,364,2,61,3.63,0


Some rental apartments were `not occupied` day-wise at all in 2019 while some were occupied for as many as `364` days (almost all year round! a year is 365 days).

In [27]:
# Range of values for minimum_nights.

NY_rental_df.sort_values("minimum_nights").head(5)


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
11319,16985,14195524,Manhattan,40.79919,-73.96249,Private room,110,47,1,87,2.13,312
11181,32404,28041818,Jamaica,40.67296,-73.79658,Private room,80,302,1,109,8.36,55
11180,24624,21112848,Jamaica,40.6651,-73.76614,Private room,70,296,1,19,0.75,81
11179,19770,16814205,Jamaica,40.67949,-73.79841,Private room,53,293,1,464,13.33,78
11178,22798,19748517,Jamaica,40.68009,-73.78736,Private room,75,217,1,19,0.65,180


In [28]:

NY_rental_df.sort_values("minimum_nights").tail(5)


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
9829,8096,6654984,Park Slope,40.67376,-73.98397,Entire home/apt,200,212,365,4,0.12,173
537,709,271694,Midtown,40.75282,-73.97315,Entire home/apt,125,0,365,19,0.2,365
9830,27436,22761054,Manhattan,40.75241,-73.98558,Entire home/apt,135,16,456,3,0.25,365
17262,10500,8668115,Crown Heights,40.67255,-73.94914,Private room,50,1,500,10,0.2,364
552,728,277370,East Village,40.73168,-73.98662,Entire home/apt,139,102,1125,53,0.58,1


This seem to tell that the `minimum_nights` a rental apartment was used for is `1` while some other were used for `1,125` nights per occupant.

Still not sure what this means practically ( the number of possible nights in a year should be 365 nighd).

We might not be dealing with a year then, must have spilled into 2020.

In [29]:
NY_rental_df.columns

Index(['F1', 'id', 'neighbourhood', 'latitude', 'longitude', 'room_type',
       'price', 'days_occupied_in_2019', 'minimum_nights', 'number_of_reviews',
       'reviews_per_month', 'availability_2020'],
      dtype='object')

In [30]:
# Range of number_of_reviews.

NY_rental_df.sort_values("number_of_reviews").head(5)


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
10833,9700,7915831,Flatbush,40.63982,-73.96634,Private room,1100,0,1,1,0.03,365
16264,33456,29081288,Brooklyn,40.65454,-73.91885,Private room,90,0,7,1,0.16,358
9308,10579,8740683,Midtown,40.7569,-73.96408,Entire home/apt,115,0,30,1,0.03,365
16261,34547,29978942,Bedford-Stuyvesant,40.69522,-73.94027,Private room,50,344,7,1,0.13,0
8006,26224,21997804,Midtown,40.76691,-73.98545,Entire home/apt,150,306,7,1,0.04,0


In [31]:

NY_rental_df.sort_values("number_of_reviews").tail(5)


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
11830,9963,8168619,Queens,40.77006,-73.87683,Private room,46,208,1,594,11.46,164
11006,1893,903947,Harlem,40.82124,-73.93838,Private room,49,25,1,623,7.56,0
11934,12660,10101135,Queens,40.66939,-73.76975,Private room,47,327,1,630,13.13,65
11320,1894,903972,Manhattan,40.82085,-73.94025,Private room,49,48,1,638,7.65,0
11187,11043,9145202,Jamaica,40.6673,-73.76831,Private room,47,339,1,675,14.02,0


The number of reviews range from `1` to `675`. 

In [32]:
# Range of reviews_per_month.

NY_rental_df.sort_values("reviews_per_month").head(5)


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
15901,1472,714075,Clinton Hill,40.68712,-73.95876,Private room,75,0,5,1,0.01,0
967,1363,652691,Manhattan,40.76934,-73.98464,Entire home/apt,95,0,30,1,0.01,365
11002,1691,808476,Harlem,40.83378,-73.94966,Private room,85,20,1,1,0.01,0
111,127,32363,Kew Garden Hills,40.74028,-73.83168,Private room,140,242,2,1,0.01,63
775,1057,479285,Sunnyside,40.74679,-73.91853,Entire home/apt,80,52,3,1,0.01,330


In [33]:

NY_rental_df.sort_values("reviews_per_month").tail(5)


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
10737,20891,18173787,East Elmhurst,40.7638,-73.87238,Private room,48,145,1,496,15.4,349
11779,19032,16276632,Queens,40.76335,-73.87007,Private room,48,83,1,577,15.84,358
11921,27421,22750161,Queens,40.66298,-73.77,Private room,50,302,1,388,16.89,58
11183,26572,22176831,Jamaica,40.66158,-73.7705,Private room,50,305,1,421,17.44,50
11186,25312,21550302,Jamaica,40.6611,-73.7683,Private room,80,317,1,489,19.25,54


`reviews_per_month` seem to be an average of the reviews for months. 
reviews_per_month range from `0.01` to `19.25` reviews per month.

In [34]:
# Range of values for availability_2020.

NY_rental_df.sort_values("availability_2020").head(5)


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
7006,17165,14386892,Windsor Terrace,40.65823,-73.98059,Entire home/apt,99,88,4,88,2.21,0
7055,4422,3352608,Brooklyn,40.58527,-73.93534,Entire home/apt,300,2,5,8,0.12,0
2466,4645,3582816,Brooklyn,40.68652,-73.95031,Entire home/apt,120,122,2,81,1.25,0
2467,7492,6161282,Brooklyn,40.71014,-73.96253,Entire home/apt,275,122,2,81,1.45,0
7052,8201,6729589,Boerum Hill,40.68542,-73.98916,Entire home/apt,115,347,5,19,0.76,0


In [35]:

NY_rental_df.sort_values("availability_2020").tail(5)


Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
10617,23240,20034180,Canarsie,40.64351,-73.91406,Private room,89,0,1,9,0.31,365
10618,23250,20044802,Canarsie,40.64138,-73.91488,Private room,75,0,1,14,0.5,365
10654,2355,1312228,Clinton Hill,40.68371,-73.96461,Private room,55,57,1,3,0.06,365
16556,1709,814327,Upper East Side,40.77171,-73.95042,Private room,81,3,10,67,0.9,365
8519,2666,1656254,Brooklyn,40.72531,-73.94222,Entire home/apt,149,16,30,10,0.14,365


The availability of the rental apartments. While some apartments were `not available` at all for 2020, there are others that were available `365` days in 2020. range of `0` to `365`.


### -Additional Steps

- Parsing dates 
- Removing unecessary columns
- Creating additional columns 
- merging multiple dataset 

Our dataset does not contain dates. 

The column `F1` seem to be uneceasary.

We might need to look for similar datasets that contains some other columns we would like to see.

In [36]:
NY_rental_df

Unnamed: 0,F1,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,reviews_per_month,availability_2020
0,0,2595,Midtown,40.75362,-73.98377,Entire home/apt,225,15,10,48,0.39,1
1,1,3831,Brooklyn,40.68514,-73.95976,Entire home/apt,89,188,1,295,4.67,1
2,2,5099,Manhattan,40.74767,-73.97500,Entire home/apt,200,362,3,78,0.60,19
3,3,5121,Bedford-Stuyvesant,40.68688,-73.95596,Private room,60,0,29,49,0.38,365
4,4,5178,Manhattan,40.76489,-73.98493,Private room,79,141,2,454,3.52,242
...,...,...,...,...,...,...,...,...,...,...,...,...
17609,28313,23691588,Brooklyn,40.69312,-73.94073,Shared room,32,9,31,5,0.26,1
17610,17415,14712466,Brooklyn,40.65446,-73.92613,Shared room,99,7,100,1,0.03,0
17611,27827,23184420,Lower East Side,40.71172,-73.98864,Shared room,41,14,180,2,0.12,365
17612,29127,24555212,Manhattan,40.71113,-73.98840,Shared room,38,0,180,1,0.27,365


In [37]:
NY_rental_df.drop(columns = ["F1"], inplace = True)


Temporarily dropping reviews per month.

In [38]:
NY_rental = NY_rental_df.drop(columns = ["reviews_per_month"]) 


In [39]:
NY_rental

Unnamed: 0,id,neighbourhood,latitude,longitude,room_type,price,days_occupied_in_2019,minimum_nights,number_of_reviews,availability_2020
0,2595,Midtown,40.75362,-73.98377,Entire home/apt,225,15,10,48,1
1,3831,Brooklyn,40.68514,-73.95976,Entire home/apt,89,188,1,295,1
2,5099,Manhattan,40.74767,-73.97500,Entire home/apt,200,362,3,78,19
3,5121,Bedford-Stuyvesant,40.68688,-73.95596,Private room,60,0,29,49,365
4,5178,Manhattan,40.76489,-73.98493,Private room,79,141,2,454,242
...,...,...,...,...,...,...,...,...,...,...
17609,23691588,Brooklyn,40.69312,-73.94073,Shared room,32,9,31,5,1
17610,14712466,Brooklyn,40.65446,-73.92613,Shared room,99,7,100,1,0
17611,23184420,Lower East Side,40.71172,-73.98864,Shared room,41,14,180,2,365
17612,24555212,Manhattan,40.71113,-73.98840,Shared room,38,0,180,1,365


In [40]:
NY_rental.id.unique().size

17614

In this dataset no two rental properties have the same id so we can be sure that each rental property holds a unique id.

In [41]:
NY_rental.columns.size


10

We now have `10` columns in the new dataframe.

In [42]:
NY_rental.to_csv("NY_rental.csv", index = None) 


Let's add an additional column to the dataset that should contain the total entries of a neighborhood for each rental properties. 

In [43]:
%time 
def no_of_neighbhd_rp(NY_rental):
    
    rp_count_list = []
    while len(rp_count_list) < 17614:
        for neighbourhood in NY_rental["neighbourhood"]:
            rp_count = NY_rental[NY_rental["neighbourhood"]==neighbourhood].neighbourhood.size
        
            rp_count_list.append(rp_count)
    
    print(rp_count_list[10:20], len(rp_count_list))
    
no_of_neighbhd_rp(NY_rental)
        

CPU times: user 28 µs, sys: 6 µs, total: 34 µs
Wall time: 62 µs
[3875, 3875, 3875, 150, 3875, 3875, 3875, 3875, 3229, 239] 17614


In [48]:
NY_neighbourhood = NY_rental[["neighbourhood"]]
NY_neighbourhood


Unnamed: 0,neighbourhood
0,Midtown
1,Brooklyn
2,Manhattan
3,Bedford-Stuyvesant
4,Manhattan
...,...
17609,Brooklyn
17610,Brooklyn
17611,Lower East Side
17612,Manhattan


In [59]:
NY_airbnb_review = pd.read_csv("airbnb_last_review.csv")
NY_airbnb_price = pd.read_csv("airbnb_price.csv") 
NY_airbnb_room = pd.read_excel("airbnb_room_type.xlsx")



ImportError: Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl.

In [52]:
NY_airbnb_review

Unnamed: 0,listing_id,host_name,last_review
0,2595,Jennifer,May 21 2019
1,3831,LisaRoxanne,July 05 2019
2,5099,Chris,June 22 2019
3,5178,Shunichi,June 24 2019
4,5238,Ben,June 09 2019
...,...,...,...
25204,36425863,Rusaa,July 07 2019
25205,36427429,H Ai,July 07 2019
25206,36438336,Ben,July 07 2019
25207,36442252,Blaine,July 07 2019


In [53]:
NY_airbnb_price

Unnamed: 0,listing_id,price,nbhood_full
0,2595,225 dollars,"Manhattan, Midtown"
1,3831,89 dollars,"Brooklyn, Clinton Hill"
2,5099,200 dollars,"Manhattan, Murray Hill"
3,5178,79 dollars,"Manhattan, Hell's Kitchen"
4,5238,150 dollars,"Manhattan, Chinatown"
...,...,...,...
25204,36425863,129 dollars,"Manhattan, Upper East Side"
25205,36427429,45 dollars,"Queens, Flushing"
25206,36438336,235 dollars,"Staten Island, Great Kills"
25207,36442252,100 dollars,"Bronx, Mott Haven"


In [57]:
NY_airbnb_price.describe() 


Unnamed: 0,listing_id
count,25209.0
mean,20689220.0
std,11029280.0
min,2595.0
25%,12022730.0
50%,22343910.0
75%,30376690.0
max,36455810.0


In [58]:
NY_rental.describe()


Unnamed: 0,id,latitude,longitude,price,days_occupied_in_2019,minimum_nights,number_of_reviews,availability_2020
count,17614.0,17614.0,17614.0,17614.0,17614.0,17614.0,17614.0,17614.0
mean,15720320.0,40.726755,-73.947732,145.45549,179.517656,7.392926,56.128988,154.154763
std,9644155.0,0.056981,0.050213,194.990677,130.202015,19.233869,65.97237,138.079651
min,2595.0,40.50868,-74.23986,0.0,0.0,1.0,1.0,0.0
25%,6718288.0,40.686042,-73.980938,70.0,35.0,2.0,9.0,8.0
50%,16546990.0,40.72054,-73.95305,109.0,198.0,3.0,33.0,125.0
75%,24077070.0,40.763127,-73.930682,170.0,301.0,5.0,79.0,309.0
max,30565280.0,40.90804,-73.72179,9999.0,364.0,1125.0,675.0,365.0
