## Creating a SQLite Database from our CSV Files

Using pandas and sqlite to create our .db file

In [1]:
import pandas as pd
import sqlite3

We use pandas to load the csv into a dataframe and display the first few rows

In [2]:
bite_df = pd.read_csv('assets/Health_AnimalBites.csv')
print(bite_df.head())

             bite_date SpeciesIDDesc BreedIDDesc GenderIDDesc       color  \
0  1985-05-05 00:00:00           DOG         NaN       FEMALE  LIG. BROWN   
1  1986-02-12 00:00:00           DOG         NaN      UNKNOWN   BRO & BLA   
2  1987-05-07 00:00:00           DOG         NaN      UNKNOWN         NaN   
3  1988-10-02 00:00:00           DOG         NaN         MALE   BLA & BRO   
4  1989-08-29 00:00:00           DOG         NaN       FEMALE     BLK-WHT   

   vaccination_yrs     vaccination_date victim_zip AdvIssuedYNDesc  \
0              1.0  1985-06-20 00:00:00      40229              NO   
1              NaN                  NaN      40218              NO   
2              NaN                  NaN      40219              NO   
3              NaN                  NaN        NaN              NO   
4              NaN                  NaN        NaN              NO   

  WhereBittenIDDesc      quarantine_date DispositionIDDesc head_sent_date  \
0              BODY  1985-05-05 00:00:0

Now sqlite creates a database and our first table named Bites

In [3]:
with sqlite3.connect('lunacy.db') as conn:
    bite_df.to_sql('Bites', conn)

ValueError: Table 'Bites' already exists.

Now we do the same for our csv with moon phase data

In [4]:
moon_df = pd.read_csv('assets/moon_illumination_1800-2100.csv')
print(moon_df.head())

   Unnamed: 0                    date  illum_pct           phase
0           0  1800-01-01 12:00:00.00      36.00   first quarter
1           1  1800-01-02 12:00:00.00      45.58   first quarter
2           2  1800-01-03 12:00:00.00      55.14   first quarter
3           3  1800-01-04 12:00:00.00      64.41   first quarter
4           4  1800-01-05 12:00:00.00      73.12  waxing gibbous


This will become the table called Moon

In [5]:
with sqlite3.connect('lunacy.db') as conn:
    moon_df.to_sql('Moon', conn)

ValueError: Table 'Moon' already exists.

## Query our SQL data

First, our animal bite data.

In [6]:
con = sqlite3.connect("lunacy.db")

bite_df = pd.read_sql("""SELECT 
                            bite_date, 
                            SpeciesIDDesc, 
                            BreedIDDesc, 
                            GenderIDDesc, 
                            color, 
                            victim_zip, 
                            WhereBittenIDDesc, 
                            ResultsIDDesc 
                        FROM 
                            bites b 
                        WHERE 
                            b.bite_date 
                        BETWEEN 
                            '2009-10-29' and '2017-09-08' 
                        ORDER BY 
                            bite_date;""", con)

print(bite_df.head())

             bite_date SpeciesIDDesc BreedIDDesc GenderIDDesc    color  \
0  2009-10-29 00:00:00           CAT        None       FEMALE     GRAY   
1  2009-12-02 00:00:00           DOG        None         MALE  TAN-BRN   
2  2009-12-11 00:00:00           DOG        None         MALE  BLK-BRN   
3  2009-12-21 00:00:00           DOG        None      UNKNOWN    BLACK   
4  2009-12-24 00:00:00           DOG        None         MALE  BRN-WHT   

  victim_zip WhereBittenIDDesc ResultsIDDesc  
0      40206              BODY       UNKNOWN  
1      40291              BODY       UNKNOWN  
2      40272              BODY       UNKNOWN  
3      40218              HEAD       UNKNOWN  
4      40165              HEAD       UNKNOWN  


Now, our lunar data.

In [7]:
moon_df = pd.read_sql("""SELECT 
                            date, 
                            illum_pct, 
                            phase 
                        FROM 
                            moon m 
                        WHERE 
                            m.date 
                        BETWEEN 
                            '2009-10-29' and '2017-09-08' 
                        ORDER BY 
                            date;""", con)

print(moon_df.head())

                     date  illum_pct           phase
0  2009-10-29 12:00:00.00      80.90  waxing gibbous
1  2009-10-30 12:00:00.00      88.09  waxing gibbous
2  2009-10-31 12:00:00.00      93.90  waxing gibbous
3  2009-11-01 12:00:00.00      97.94            full
4  2009-11-02 12:00:00.00      99.88            full


We first see what we're working with.

In [8]:
print(bite_df.info())
print(moon_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8631 entries, 0 to 8630
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   bite_date          8631 non-null   object
 1   SpeciesIDDesc      8530 non-null   object
 2   BreedIDDesc        3696 non-null   object
 3   GenderIDDesc       6355 non-null   object
 4   color              6307 non-null   object
 5   victim_zip         6879 non-null   object
 6   WhereBittenIDDesc  8303 non-null   object
 7   ResultsIDDesc      1328 non-null   object
dtypes: object(8)
memory usage: 539.6+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2871 entries, 0 to 2870
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   date       2871 non-null   object 
 1   illum_pct  2871 non-null   float64
 2   phase      2871 non-null   object 
dtypes: float64(1), object(2)
memory usage: 67.4+ KB
None


First, we'll convert our two date columns to datetime64, then drop the times as one is empty and the other is noon.

In [9]:
bite_df['bite_date'] = pd.to_datetime(bite_df['bite_date'], errors='coerce')
print(bite_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8631 entries, 0 to 8630
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   bite_date          8631 non-null   datetime64[ns]
 1   SpeciesIDDesc      8530 non-null   object        
 2   BreedIDDesc        3696 non-null   object        
 3   GenderIDDesc       6355 non-null   object        
 4   color              6307 non-null   object        
 5   victim_zip         6879 non-null   object        
 6   WhereBittenIDDesc  8303 non-null   object        
 7   ResultsIDDesc      1328 non-null   object        
dtypes: datetime64[ns](1), object(7)
memory usage: 539.6+ KB
None


In [10]:
moon_df['date'] = pd.to_datetime(moon_df['date'])
print(moon_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2871 entries, 0 to 2870
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date       2871 non-null   datetime64[ns]
 1   illum_pct  2871 non-null   float64       
 2   phase      2871 non-null   object        
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 67.4+ KB
None


We will need to join our data, using the date. Currently our bite dataframe possesses empty timestamps in the datetime (00:00:00.00) and our lunar dataframe records all times at exactly noon (12:00:00.00). We will convert the column to just the date, and then need to again set the column as datetime64.

In [11]:
moon_df['date'] = moon_df['date'].dt.date
print(moon_df.info())
moon_df['date'] = pd.to_datetime(moon_df['date'])
print(moon_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2871 entries, 0 to 2870
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   date       2871 non-null   object 
 1   illum_pct  2871 non-null   float64
 2   phase      2871 non-null   object 
dtypes: float64(1), object(2)
memory usage: 67.4+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2871 entries, 0 to 2870
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date       2871 non-null   datetime64[ns]
 1   illum_pct  2871 non-null   float64       
 2   phase      2871 non-null   object        
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 67.4+ KB
None


We now combine our two dataframes on the date.

In [12]:
df = pd.merge(bite_df, moon_df, left_on='bite_date', right_on='date')
print(df.head())

   bite_date SpeciesIDDesc BreedIDDesc GenderIDDesc    color victim_zip  \
0 2009-10-29           CAT        None       FEMALE     GRAY      40206   
1 2009-12-02           DOG        None         MALE  TAN-BRN      40291   
2 2009-12-11           DOG        None         MALE  BLK-BRN      40272   
3 2009-12-21           DOG        None      UNKNOWN    BLACK      40218   
4 2009-12-24           DOG        None         MALE  BRN-WHT      40165   

  WhereBittenIDDesc ResultsIDDesc       date  illum_pct            phase  
0              BODY       UNKNOWN 2009-10-29      80.90   waxing gibbous  
1              BODY       UNKNOWN 2009-12-02      99.96             full  
2              BODY       UNKNOWN 2009-12-11      24.10  waning crescent  
3              HEAD       UNKNOWN 2009-12-21      21.11  waxing crescent  
4              HEAD       UNKNOWN 2009-12-24      47.87    first quarter  


For the sake of having a copy on file instead of memory, I will now save our merged dataframe as a CSV file.

In [13]:
df.to_csv('assets/merged_df.csv')

Now that our data is all in one dataframe, is is time for a bit of exploratory data analysis.

In [14]:
print(df.describe(include='all', datetime_is_numeric=True))

                            bite_date SpeciesIDDesc BreedIDDesc GenderIDDesc  \
count                            8631          8530        3696         6355   
unique                            NaN             9         101            3   
top                               NaN           DOG    PIT BULL         MALE   
freq                              NaN          6883        1078         3769   
mean    2013-10-14 19:02:31.407716352           NaN         NaN          NaN   
min               2009-10-29 00:00:00           NaN         NaN          NaN   
25%               2011-11-14 00:00:00           NaN         NaN          NaN   
50%               2013-09-12 00:00:00           NaN         NaN          NaN   
75%               2015-08-18 12:00:00           NaN         NaN          NaN   
max               2017-09-07 00:00:00           NaN         NaN          NaN   
std                               NaN           NaN         NaN          NaN   

        color victim_zip WhereBittenIDD

In [15]:
# What kinds of dogs are out there biting the citizens of Louisville?
breed_total = df["BreedIDDesc"].value_counts()
print(breed_total)

# What types of animals appear on this list?
unique_species = df['SpeciesIDDesc'].unique()
print(unique_species)

# How many bites from these different species?
species_total = df["SpeciesIDDesc"].value_counts()
print(species_total)

PIT BULL           1078
GERM SHEPHERD       323
LABRADOR RETRIV     248
BOXER               181
CHICHAUHUA          164
                   ... 
RED HEELER            1
BRIARD                1
CHOCOLATE LAB.        1
FOX TERRIER MIX       1
IRISH WOLFHOUND       1
Name: BreedIDDesc, Length: 101, dtype: int64
['CAT' 'DOG' 'BAT' 'RACCOON' 'OTHER' None 'HORSE' 'RABBIT' 'SKUNK'
 'FERRET']
DOG        6883
CAT        1529
BAT          76
RACCOON      21
OTHER         8
HORSE         5
FERRET        4
RABBIT        3
SKUNK         1
Name: SpeciesIDDesc, dtype: int64


+ Fewer horse bites than one may assume for Kentucky. 
+ With seventy-six reported bat bites, I now question some of the information that documentarian David Attenborough has mentioned, seemingly contradicting how he claims bats behave. 
+ I also wonder if our one unfortunate skunk-bite victim knows how unique their situation really was.

In [16]:
# How many total bites occurred during each moon phase?
phases_total = df["phase"].value_counts()
print(phases_total)

waning gibbous     1352
waning crescent    1300
waxing gibbous     1295
waxing crescent    1279
last quarter        883
first quarter       863
full                832
new                 827
Name: phase, dtype: int64


Full moons appear to be on the low end of total reported bites.

In [17]:
# df.corr()

To find any correlation between between moon phases and bites, I believe we'll also need to include dates in which no bites were reported. For this we will be better off making another call to tailor a separate dataframe to our parameters. We'll start with ALL moon phases and dates and join bite data to include the necessary data.

In [22]:
# SQL request for moon data for all dates in this timeframe
moon_corr_df = pd.read_sql("""SELECT 
                            date, 
                            illum_pct, 
                            phase 
                        FROM 
                            moon m 
                        WHERE 
                            m.date 
                        BETWEEN 
                            '2009-10-29' and '2017-09-08' 
                        ORDER BY 
                            date;""", con)

print(moon_corr_df.head())

# DEBUG make a function out of this?
# again convert to datetime, split on date, convert to datetime
moon_corr_df['date'] = pd.to_datetime(moon_corr_df['date'])
print("1!!")
print(moon_corr_df.info())
moon_corr_df['date'] = moon_corr_df['date'].dt.date
print("2!!")
print(moon_corr_df.info())
moon_corr_df['date'] = pd.to_datetime(moon_corr_df['date'])
print("3!!")
print(moon_corr_df.info())

                     date  illum_pct           phase
0  2009-10-29 12:00:00.00      80.90  waxing gibbous
1  2009-10-30 12:00:00.00      88.09  waxing gibbous
2  2009-10-31 12:00:00.00      93.90  waxing gibbous
3  2009-11-01 12:00:00.00      97.94            full
4  2009-11-02 12:00:00.00      99.88            full
1!!
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2871 entries, 0 to 2870
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   date       2871 non-null   datetime64[ns]
 1   illum_pct  2871 non-null   float64       
 2   phase      2871 non-null   object        
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 67.4+ KB
None
2!!
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2871 entries, 0 to 2870
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   date       2871 non-null   object 
 1   illum_pct  2871 non-null 

In [42]:
corr_df = pd.merge(moon_corr_df, bite_df, how='left', left_on='date', right_on='bite_date')
print(corr_df.head())

        date  illum_pct           phase  bite_date SpeciesIDDesc BreedIDDesc  \
0 2009-10-29      80.90  waxing gibbous 2009-10-29           CAT        None   
1 2009-10-30      88.09  waxing gibbous        NaT           NaN         NaN   
2 2009-10-31      93.90  waxing gibbous        NaT           NaN         NaN   
3 2009-11-01      97.94            full        NaT           NaN         NaN   
4 2009-11-02      99.88            full        NaT           NaN         NaN   

  GenderIDDesc color victim_zip WhereBittenIDDesc ResultsIDDesc  
0       FEMALE  GRAY      40206              BODY       UNKNOWN  
1          NaN   NaN        NaN               NaN           NaN  
2          NaN   NaN        NaN               NaN           NaN  
3          NaN   NaN        NaN               NaN           NaN  
4          NaN   NaN        NaN               NaN           NaN  


In [43]:
# Create a new column where each day in 
# corr_df['moon_corr_data'] = 1

# create a new column 'new_col' with 1's and 0's
corr_df['bite_corr_data'] = corr_df['bite_date'].apply(lambda x: 1 if not pd.isna(x) else 0)

print(corr_df)

           date  illum_pct           phase  bite_date SpeciesIDDesc  \
0    2009-10-29      80.90  waxing gibbous 2009-10-29           CAT   
1    2009-10-30      88.09  waxing gibbous        NaT           NaN   
2    2009-10-31      93.90  waxing gibbous        NaT           NaN   
3    2009-11-01      97.94            full        NaT           NaN   
4    2009-11-02      99.88            full        NaT           NaN   
...         ...        ...             ...        ...           ...   
8850 2017-09-06      99.95            full 2017-09-06           DOG   
8851 2017-09-07      98.29            full 2017-09-07           DOG   
8852 2017-09-07      98.29            full 2017-09-07           DOG   
8853 2017-09-07      98.29            full 2017-09-07           DOG   
8854 2017-09-07      98.29            full 2017-09-07           DOG   

          BreedIDDesc GenderIDDesc    color victim_zip WhereBittenIDDesc  \
0                None       FEMALE     GRAY      40206              BOD

In [45]:


# calculate the correlation between the two columns
correlation = corr_df['phase'].astype('category').cat.codes.corr(corr_df['bite_corr_data'])

# print the correlation coefficient
print('Correlation coefficient:', correlation)


Correlation coefficient: -0.012269673917378775


A correlation coefficient of -0.012269673917378775 is very close to zero, which suggests that there is little to no linear relationship between the moon phase and the incidence of the event. In other words, the moon phase and the incidence of the event are essentially independent of each other.

Note that while there may be no linear relationship between the two variables, it is still possible that they are related in a non-linear way. For example, it's possible that the incidence of the event is highest during certain phases of the moon, but not in a way that can be described by a linear relationship.

In summary, a correlation coefficient of -0.012269673917378775 suggests that the moon phase is not a strong predictor of the incidence of the event, and that other factors may be more important in determining when the event occurs.

In [49]:
# group by moon_phase and calculate correlation coefficient
#corr_grouped = corr_df.groupby('phase')
#corr = corr_grouped.corr().iloc[0::2,-1]

#print(corr)




# convert "phase" column to categorical data type
# corr_df["phase"] = pd.Categorical(corr_df["phase"], categories=['first quarter', 'waxing gibbous', 'full', 
#                                                                 'waning gibbous', 'last quarter', 
#                                                                 'waning crescent', 'new', 'waxing crescent'])

# # group the dataframe by "phase" and calculate correlation coefficients for each group
# corr_grouped = corr_df.groupby('phase')
# correlations = corr_grouped.apply(lambda x: x['phase'].cat.codes.corr(x['bite_corr_data']))

# # print the correlation coefficients for each moon phase
# print(correlations)

phase
first quarter     NaN
waxing gibbous    NaN
full              NaN
waning gibbous    NaN
last quarter      NaN
waning crescent   NaN
new               NaN
waxing crescent   NaN
dtype: float64
