# Analysis of a Property Dataset

The assignment will focus on pre-processing data using the Pandas library, followed by the creation of plots using matplotlib and seaborn libraries.

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

### Importing Libraries

Run the following cell to import packages and set plotting styling. 

**The plotting styling should not be changed**; doing so may result in KATE incorrectly evaluating your plots.

*Note: `matplotlib` does a lot of work in the background to "guess" what figure to plot on. This can have the effect of modifying figures you have created before in the notebook, which will cause your plots to be wrong on KATE. To ensure your plots are always created properly, call `plt.figure()` before each command that creates a new plot, this will ensure you plot on a new figure everytime.*

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR') #prevents unnecessary matplotlib warnings about seaborn color palette

### About the Dataset

Please refer to the `data/data-dictionary.pdf` file outlining details about the dataset each field (properties and their characteristics)

### Importing the Dataset

Use `.read_csv()` to get our dataset `data/assessments.csv` and assign to DataFrame `df`:

In [None]:
df = pd.read_csv('data/assessments.csv')

Running `df.head()` and `df.info()` will show us how the DataFrame is structured:

In [None]:
#df.head()

In [None]:
#df.info()

### Charting Residential Properties with Pandas

**Q1.** Refer to the `df` DataFrame. Create a new DataFrame called `res` containing only entries from `df` with a `CLASSDESC` of `'RESIDENTIAL'`.

- Use the `.copy()` method to ensure you have a distinct DataFrame in memory
- Call the new dataframe `res`

See below code syntax for some guidance:
```python
res['CLASSDESC']=='RESIDENTIAL'
```

In [None]:
#add your code below
#res = df.copy()



**Q2.** Create a new DataFrame called `res_16` containing only properties from `res` with `BEDROOMS` greater than 0 and less than 7.

- Use the `.copy()` method so that you have a distinct DataFrame in memory
- Call the new dataframe `res_16`
- Use the `.notnull()` method to filter out the rows in `BEDROOMS` which are null
- Use the `.astype()` method to change the data type of the `BEDROOMS` column to `int` : `.astype(int)`
- Filter the new DataFrame to only contain rows where `BEDROOMS` is greater than `0` and less than `7` : `(res_16['BEDROOMS'] > 0) & (res_16['BEDROOMS'] < 7)`




In [None]:
#add your code below
#res_16 = res.copy()



**Q3.** Use `.groupby()` on `res_16` DataFrame to create a Series with an index of `BEDROOMS` and values of the `.mean()` of `FULLBATHS` for each number of `BEDROOMS`. Assign this series to a new variable called `bed_bath`:

See below code syntax for some guidance:
```python
bed_bath = DataFrame_Name.groupby(by=...)[column].mean() 
```

Below snippet showcases how the resulting series should look like:

```python
BEDROOMS
1    1.030303
2    1.173469
3    1.354132
4    2.236301
...
...
```

In [None]:
#add your code below
#bed_bath = ...



**Q4.** Refer to the `bed_bath` variable from above question, also note `bed_bath` is a pandas `series` data object.

Use the `.plot()` method on `bed_bath` to create a line plot with `kind` parameter set to `line`.

- This should result in a line plot of `BEDROOMS` on the **x-axis** with the average number of `FULLBATHS` on the **y-axis**
- Save your plot into a new variable `bb_line`


See below code syntax for some guidance:
```python
plt.figure()
bb_line = Series_Name.plot(kind='line')
```

In [None]:
#add your code below
#We create a new figure to make sure other figures in the notebook don't get modified
plt.figure()
#bb_line = ...



**Q5.** Refer to the `res_16` DataFrame. 
- Using the `res_16['BEDROOMS']` Series calculate `.value_counts()` for each value in the series
- Then use `.sort_index()` to order it by the index
- Save the results to a new variable called `beds`


See below code syntax for some guidance:
```python
beds = Series_Name.value_counts().sort_index()
```

Below snippet showcases how the resulting series should look like:

```python
1     33
2    294
3    593
4    292
...
...
```

In [None]:
#add your code below
#beds = ...



**Q6.** Refer to the `beds` variable from above question, also note `beds` is a pandas `series` data object.

- Use the `.plot()` method on `beds` to create a bar plot with `kind` parameter set to `bar` and `title` parameter set to `Residential housing by number of bedrooms`
- Save your line plot into a variable called `beds_bar`


See below code syntax for some guidance:
```python
plt.figure()
beds_bar = Series_Name.plot(kind='bar', title=...)
```

In [None]:
#add your code below
#We create a new figure to make sure other figures in the notebook don't get modified
plt.figure()
#beds_bar = 



**Q7.** Create a function called `zip_land` which takes two arguments: a DataFrame (with the same columns as `df`) and an integer (which it can be assumed will always be present in the `PROPERTYZIP` column of the DataFrame).

This function will need to filter down the `df` argument to the rows where the `PROPERTYZIP` column is equal to the `zip_code` argument, before returning a `scatter` plot with the following properties:

   - `x`='LOTAREA'
   - `y`='FAIRMARKETLAND'
   - `xlim` and `ylim` both from `0` to double the `.mean()` of the respective column values
   - `alpha`=0.4
   - `figsize`=(12,10)


```python
def zip_land(df, zip_code):
    my_plot = [command to plot]
    return my_plot

```
Please note you have been provided with the code for this question to carry out the necessary analysis. Simply uncomment the lines of code and run the code cell to produce the desired results.

In [None]:
#add your code below

#def zip_land(df, zip_code):
#    sub = df[(df['PROPERTYZIP'] == zip_code)]
#    zip_chart = sub.plot(x='LOTAREA', 
#                         y='FAIRMARKETLAND', 
#                         kind='scatter', 
#                         xlim=(0, sub['LOTAREA'].mean() * 2), 
#                         ylim=(0, sub['FAIRMARKETLAND'].mean() * 2), 
#                         alpha=0.4, 
#                         figsize=(12, 10));
#    
#    return zip_chart





Run the following code cell to check that your function returns a chart as expected:

In [None]:
#plt.figure()
#zip_chart = zip_land(df, 15236)

### Charting Property Values with Seaborn

**Q8.** Refer to the `df` DataFrame. Create a new DataFrame called `sales` which contains only entries from `df` with a `SALEDESC` of `'VALID SALE'`.
- Use the `.copy()` method to ensure you have a distinct DataFrame in memory
- Call the new dataframe `sales`


See below code syntax for some guidance:
```python
sales['SALEDESC']=='VALID SALE'
```

In [None]:
#add your code below
#sales = df.copy()



**Q9.** Add a column to `sales` called `PITTSBURGH`, containing boolean values of `True` where `PROPERTYCITY` equals `PITTSBURGH` and `False` if not.

See below code syntax for some guidance:
```python
sales[new_column_name] = sales['PROPERTYCITY'] == 'PITTSBURGH'
```

In [None]:
#add your code below
#sales['PITTSBURGH'] =



**Q10.** Create a seaborn `.violinplot()` with the following properties:

`x` = `'PITTSBURGH'`  
`y` = `'FAIRMARKETTOTAL'`   
`data` = only entries from `sales` where `sales['BEDROOMS'] == 1]`

Call the new variable `pitts_violin`

See below code syntax for some guidance:
```python
plt.figure()
pitts_violin = sns.violinplot(x=..., y=..., data=...);
```

In [None]:
#add your code below
#We create a new figure to make sure other figures in the notebook don't get modified
plt.figure()
#pitts_violin = 



**Q11.** Create a seaborn `.regplot()` with the following properties:

`x` = `'SALEPRICE'`    
`y` = `'FAIRMARKETTOTAL'`  
`data` = only entries from `sales` where `sales['GRADEDESC'] == 'EXCELLENT'`

Call the new variable `exc_reg`

See below code syntax for some guidance:
```python
plt.figure()
exc_reg = sns.regplot(x=..., y=..., data=...)
```

In [None]:
#add your code below
#We create a new figure to make sure other figures in the notebook don't get modified
plt.figure()
#exc_reg = 



**Q12.** Create a DataFrame called `bus` which contains only entries from `sales` where `CLASSDESC` `.isin(['COMMERCIAL', 'INDUSTRIAL', 'AGRICULTURAL'])`.

Please note you have been provided with the code for this question to carry out the necessary data manipulation work. Simply uncomment the lines of code and run the code cell to produce the desired results.

In [None]:
#add your code below
#bus = sales[sales['CLASSDESC'].isin(['COMMERCIAL', 'INDUSTRIAL', 'AGRICULTURAL'])]
#bus.head()



**Q13.** Create a DataFrame using the `.groupby()` method on the `bus` DataFrame with the following properties:

- Data grouped by `['CLASSDESC', 'PITTSBURGH']` where the values are of the `.mean()` of `'FAIRMARKETTOTAL'`
- Use `.reset_index()` so that a DataFrame is created
- Use `.sort_values(by='FAIRMARKETTOTAL')` to order it

Call the new dataframe `bus_value`

Please note you have been provided with the code for this question to carry out the necessary data manipulation work. Simply uncomment the lines of code and run the code cell to produce the desired results.

In [None]:
#add your code below
#bus_value = bus.groupby(['CLASSDESC', 'PITTSBURGH'])['FAIRMARKETTOTAL'].mean().reset_index().sort_values(by='FAIRMARKETTOTAL')
#bus_value



**Q14.** Create a seaborn `.barplot()` with the following properties:
- `x` = `'CLASSDESC'`
- `y` = `'FAIRMARKETTOTAL'`
- `hue` = `'PITTSBURGH'`
- `data` = `bus_value`

Call the new variable `bus_bar`

See below code syntax for some guidance:
```python
plt.figure()
bus_bar = sns.barplot(x=..., y=..., hue=..., data=...);
```

In [None]:
#add your code below
#We create a new figure to make sure other figures in the notebook don't get modified
plt.figure()
#bus_bar = 

