# DataJam v.2 
By: Minchan Kim

Updates since v.1: 
* imported geopandas, shapely.geometry, matplotlib
* uploaded California geospatial map that breaks down the state into administrative regions
* made each bluebird datapoint into a location that is useable by geopandas with set_crs
* spatial joined the updated bluebird data set and California .shp file
* created a basic choropleth from the spatial join using geopandas libraries
* created pie and bar charts showing distribution of bluebird sightings in different counties

## Import Statements:

In [None]:
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Point, Polygon
%matplotlib inline
import matplotlib.pyplot as plt

## Creating geospatial data using bluebird csv file:

In [None]:
bluebirds = pd.read_csv("files/bluebird0209.csv")
bluebirds = bluebirds[bluebirds.get('Format') == 'Photo']

In [None]:
columns_with_NaN = bluebirds.columns[bluebirds.isnull().any()]
bluebirds_cleaned = bluebirds.drop(columns = columns_with_NaN)
bluebirds_cleaned = bluebirds_cleaned.drop(columns = ['Format', 'Common Name', 'Scientific Name', 
                                                        'Recordist', 'Playback', 'Parent Species', 'Taxon Category'])

In [None]:
cali_bluebirds = bluebirds_cleaned[bluebirds_cleaned.get('State') == 'California']
cali_bluebirds

In [None]:
bluebird_points = cali_bluebirds.apply(lambda row: Point(row.Longitude, row.Latitude), axis = 1)
bluebird_points

In [None]:
cali_bluebird_points = gpd.GeoDataFrame(cali_bluebirds, geometry = bluebird_points)
#makes geometry into latitude and longitude.
cali_bluebird_points = cali_bluebird_points.set_crs(epsg=4269)
cali_bluebird_points.head()

In [None]:
cali_bluebird_points.plot(figsize = (10, 10), color = 'purple', markersize = 4, alpha = .3).axis('off')

In [None]:
cali_shape = gpd.read_file("files/cali/tl_2019_06_cousub.shp")
cali_shape

## Experimenting with the California .shp file:

In [None]:
cali_shape.plot()

In [None]:
cali_shape.crs

In [None]:
# Parts of coloring shapes:
# 1. fill - inside part
# 2. stroke/line/edge - outline around our shape

In [None]:
cali_shape.set_crs(epsg=4269)
cali_resize = cali_shape.plot(figsize = (10,10), color = 'grey', edgecolor = 'white')

## Creating the choropleth:

In [None]:
joined = gpd.sjoin(cali_shape, cali_bluebird_points, how = 'inner', predicate = 'contains')
joined

In [None]:
administrative_regions = cali_shape.set_index('NAME')[['geometry']]
administrative_regions

In [None]:
administrative_regions['count'] = joined['NAME'].value_counts()
administrative_regions

In [None]:
administrative_regions.plot(column = 'count', cmap = 'Oranges', legend = True, figsize = (20, 20), edgecolor = 'grey')

## Creating pie and bar charts:

In [None]:
counties = bluebirds.groupby('County').count()
counties

In [None]:
socal_counties = ['San Luis Obispo', 'Kern', 'Santa Barbara',
                  'Ventura', 'Los Angeles', 'San Bernardino', 'Orange',
                  'Riverside', 'San Diego', 'Imperial']
socal = counties.loc[socal_counties]
socal

In [None]:
socal.plot.pie(y = 'Date')

In [None]:
socal.plot.bar(y = 'Date')

## Conclusions and Future Insights:
### Conclusions:
* It seems that coastal regions have more sightings of bluebirds.
* Although unrelated to our study, Santa Clara weirdly seems to have a lot of sightings compared to the rest of Northern California.
* There most likely is a relationship between number of bird watchers and number of sightings. 
* More affluent regions seem to have more sightings.

### Further Insights:
* A multi-point geometric object with GeoPandas may be more useful than a choropleth. 
* The map needs to be adjusted to only socal for the purposes of this study, rather than all of California.
* Needs more than 10,000 rows to do better analysis (work in progress).