In [1]:
import pandas as pd
import numpy as np

## Data Preparation

Steps:
- read in 200K lines of original csv with just **['COMMON NAME', 'COUNTRY', 'STATE', 'COUNTY', 'LATITUDE', 'LONGITUDE', 'OBSERVATION DATE', 'OBSERVATION COUNT']** columns
- rename columns for easier working
- replace 'X' with 1 in **observ_count** column
- filter just birds in 'United States'
- convert **OBSERVATION DATE** to datetime
- extract month and year from **OBSERVATION DATE** into their own columns
- **'season'** column from month
- **county_state** column to merge on, no space
- load region excel file
- strip leading whitespace from **State**
- drop 'county' from **CountyName**
- merge on county
- **counts** Series with percentage of each bird
- **total_rarity** column mapped from **counts**
- **regional_rarity** from counts split by region
- **seasonal_rarity** from counts by region and season
- **rarity_label** if any of the three comes out as rare

In [6]:
# Step 1: Read in the first 200K lines, with 8 columns
df = pd.read_csv('C:\\Users\\ajaco\\Downloads\\ebd_relJan-2020.txt', sep='\t', nrows=200000, usecols=['COMMON NAME', 'COUNTRY', 'STATE', 'COUNTY', 'LATITUDE', 'LONGITUDE', 'OBSERVATION DATE', 'OBSERVATION COUNT'])

print(df.shape)

df.head()

(200000, 8)


Unnamed: 0,COMMON NAME,OBSERVATION COUNT,COUNTRY,STATE,COUNTY,LATITUDE,LONGITUDE,OBSERVATION DATE
0,Magnolia Warbler,2,United States,Illinois,Cook,41.775629,-87.583273,1995-08-27
1,White-rumped Sandpiper,4,Canada,Quebec,Manicouagan,49.21667,-68.15,1993-11-07
2,Common Scoter,1,Sweden,Hallands län [SE-13],,57.065084,12.243579,1998-02-21
3,Ring-billed Gull,15,Canada,Manitoba,South Interlake,50.193256,-97.137935,1985-04-14
4,Red-winged Blackbird,500,Canada,Manitoba,South Interlake,50.193256,-97.137935,1986-09-01


In [7]:
df.isnull().sum()

COMMON NAME              0
OBSERVATION COUNT        0
COUNTRY                  0
STATE                    0
COUNTY               29261
LATITUDE                 0
LONGITUDE                0
OBSERVATION DATE         0
dtype: int64

In [10]:
# Rename columns for ease of use
df.rename(columns={
    'COMMON NAME': 'name',
    'OBSERVATION COUNT': 'observ_count',
    'COUNTRY': 'country',
    'STATE': 'state',
    'COUNTY': 'county',
    'LATITUDE': 'latitude',
    'LONGITUDE': 'longitude',
    'OBSERVATION DATE': 'observ_date'
}, inplace=True)

In [11]:
df.head()

Unnamed: 0,name,observ_count,country,state,county,latitude,longitude,observ_date
0,Magnolia Warbler,2,United States,Illinois,Cook,41.775629,-87.583273,1995-08-27
1,White-rumped Sandpiper,4,Canada,Quebec,Manicouagan,49.21667,-68.15,1993-11-07
2,Common Scoter,1,Sweden,Hallands län [SE-13],,57.065084,12.243579,1998-02-21
3,Ring-billed Gull,15,Canada,Manitoba,South Interlake,50.193256,-97.137935,1985-04-14
4,Red-winged Blackbird,500,Canada,Manitoba,South Interlake,50.193256,-97.137935,1986-09-01


In [12]:
# Filter for just US birds
us_birds = df.query("country == 'United States'")

print(us_birds.shape)
us_birds.head()

(105294, 8)


Unnamed: 0,name,observ_count,country,state,county,latitude,longitude,observ_date
0,Magnolia Warbler,2,United States,Illinois,Cook,41.775629,-87.583273,1995-08-27
6,Greater Yellowlegs,X,United States,Texas,Aransas,28.240392,-96.818819,1986-04-06
12,White-crowned Sparrow,X,United States,Arizona,Cochise,31.898164,-109.115932,1998-11-27
13,Green-winged Teal,11,United States,Idaho,Ada,43.609793,-116.206427,1982-12-18
14,Yellow-rumped Warbler,5,United States,Idaho,Ada,43.609793,-116.206427,1982-12-18


In [13]:
# Replace 'X' in 'observ_count' with 1
us_birds['observ_count'] = us_birds['observ_count'].apply(lambda x: 1 if x == 'X' else x)

us_birds.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,name,observ_count,country,state,county,latitude,longitude,observ_date
0,Magnolia Warbler,2,United States,Illinois,Cook,41.775629,-87.583273,1995-08-27
6,Greater Yellowlegs,1,United States,Texas,Aransas,28.240392,-96.818819,1986-04-06
12,White-crowned Sparrow,1,United States,Arizona,Cochise,31.898164,-109.115932,1998-11-27
13,Green-winged Teal,11,United States,Idaho,Ada,43.609793,-116.206427,1982-12-18
14,Yellow-rumped Warbler,5,United States,Idaho,Ada,43.609793,-116.206427,1982-12-18


In [14]:
us_birds.observ_date = pd.to_datetime(us_birds['observ_date'], infer_datetime_format=True)
us_birds['year'] = us_birds.observ_date.dt.year
us_birds['month'] = us_birds.observ_date.dt.month

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
