<a href="https://colab.research.google.com/github/mrcuny/python_assignment/blob/main/07_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment 7**

# **Weeks 8 & 9 - Pandas**
* In this homework assignment, you will explore and analyze a public dataset of your choosing. Since this assignment is “open-ended” in nature, you are free to expand upon the requirements below. However, you must meet the minimum requirments as indicated in each section. 

* You must use Pandas as the **primary tool** to process your data.

* The preferred method for this analysis is in a .ipynb file. Feel free to use whichever platform of your choosing.  
 * https://www.youtube.com/watch?v=inN8seMm7UI (Getting started with Colab).

* Your data should need some "work", or be considered "dirty".  You must show your skills in data cleaning/wrangling.

### **Some data examples:**
•	https://www.data.gov/

•	https://opendata.cityofnewyork.us/

•	https://datasetsearch.research.google.com/

•	https://archive.ics.uci.edu/ml/index.php

### **Resources:**

•	https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html 

•	https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html


### **Headings or comments**
**You are required to make use of comments, or headings for each section.  You must explain what your code is doing, and the results of running your code.**  Act as if you were giving this assignment to your manager - you must include clear and descriptive information for each section.

### **You may work as a group or indivdually on this assignment.**


# Introduction

In this section, please describe the dataset you are using.  Include a link to the source of this data.  You should also provide some explanation on why you choose this dataset.

For this assignment, I have chosen the "New York City Airbnb Open Data" dataset from Kaggle (https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data). This dataset contains information about Airbnb listings in New York City, including information on the host, location, availability, price, etc. I chose this dataset because it is interesting to explore the Airbnb market in a major city like New York and there is a lot of potential for data analysis and visualization.

______________
# Data Exploration
Import your dataset into your .ipynb, create dataframes, and explore your data.  

Include: 

* Summary statistics means, medians, quartiles, 
* Missing value information
* Any other relevant information about the dataset.  



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("AB_NYC_2019.csv")

#view the data
print("View the data: \n")
print(df.head())

# Check the summary statistics for the numeric columns
print("\nCheck the summary statistics for the numeric columns: \n")
print(df.describe())

#Check for missing values
print("\nCheck for missing values: \n")
print(df.isnull().sum())

#Check the structure of the data
print("\nCheck the structure of the data: \n")
print(df.info())


# Data Wrangling
Create a subset of your original data and perform the following.  

1. Modify multiple column names.

2. Look at the structure of your data – are any variables improperly coded? Such as strings or characters? Convert to correct structure if needed.

3. Fix missing and invalid values in data.

4. Create new columns based on existing columns or calculations.

5. Drop column(s) from your dataset.

6. Drop a row(s) from your dataset.

7. Sort your data based on multiple variables. 

8. Filter your data based on some condition. 

9. Convert all the string values to upper or lower cases in one column.

10. Check whether numeric values are present in a given column of your dataframe.

11. Group your dataset by one column, and get the mean, min, and max values by group. 
  * Groupby()
  * agg() or .apply()

12. Group your dataset by two columns and then sort the aggregated results within the groups. 

**You are free (and should) to add on to these questions.  Please clearly indicate in your assignment your answers to these questions.**

In [None]:

# Select a subset of columns
subset_df = df[['name', 'host_id', 'neighbourhood_group', 'price', 'number_of_reviews', 'last_review', 'reviews_per_month']]

# Modify column names
subset_df = subset_df.rename(columns={'neighbourhood_group': 'borough', 'number_of_reviews': 'reviews'})

# Fix missing and invalid values in data
subset_df.dropna(inplace=True)
subset_df = subset_df[(subset_df['price'] > 0) & (subset_df['price'] < 1000)]

# Create new columns based on existing columns or calculations
subset_df['price_per_review'] = subset_df['price'] / subset_df['reviews']

# Drop column(s) from your dataset
subset_df.drop(['host_id'], axis=1, inplace=True)

# Drop a row(s) from your dataset
subset_df.drop(index=subset_df[subset_df['borough'] == 'Unknown'].index, inplace=True)

# Sort your data based on multiple variables
subset_df.sort_values(by=['borough', 'price'], ascending=[True, False], inplace=True)

# Filter your data based on some condition
subset_df = subset_df[subset_df['reviews'] > 50]

# Convert all the string values to upper or lower cases in one column
subset_df['borough'] = subset_df['borough'].str.lower()

# Check whether numeric values are present in a given column of your dataframe
subset_df['last_review'].str.isnumeric()

# Group your dataset by one column, and get the mean, min, and max values by group
subset_df.groupby('borough').agg({'price': ['mean', 'min', 'max']})

# Group your dataset by two columns and then sort the aggregated results within the groups
subset_df.groupby(['borough', 'price']).agg({'reviews': 'mean'}).sort_values(by=['borough', 'reviews'], ascending=[True, False])

# Conclusions  

After exploring your dataset, provide a short summary of what you noticed from this dataset.  What would you explore further with more time?

From the dataset exploration and wrangling, we can see that the price of Airbnb listings in New York City varies by borough, with Manhattan having the highest average price per night. We can further explore the relationship between price and various other factors such as the type of room, availability, and neighborhood. Additionally, we can use modeling techniques to predict the price of Airbnb listings based on these factors.

In [8]:
from google.colab import files
uploaded = files.upload()


Saving AB_NYC_2019.csv to AB_NYC_2019.csv
