# Business Task:
  ## Identify the honey bee population trend and how the following factors like  Climate change, parasites and disease contributed to the trend 

# Library Importation

In [268]:
import pandas as pd 
import numpy as np 
import plotly.express as px 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score 
import statsmodels.api as sm


# Data Importation and Validation 

In [209]:
df = pd.read_csv('/kaggle/input/save-the-honey-bees/save_the_bees.csv')
print("The Dataframe's Columns are:")
print(df.columns)

The Dataframe's Columns are:
Index(['state', 'num_colonies', 'max_colonies', 'lost_colonies',
       'percent_lost', 'added_colonies', 'renovated_colonies',
       'percent_renovated', 'quarter', 'year', 'varroa_mites',
       'other_pests_and_parasites', 'diseases', 'pesticides', 'other',
       'unknown'],
      dtype='object')


In [210]:
df.shape
print('The shape of the dataframe is:')
print(df.shape)

The shape of the dataframe is:
(1453, 16)


In [211]:
print('The structure and datatype of the dataframe:')
df_structure = df.info()

The structure and datatype of the dataframe:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1453 entries, 0 to 1452
Data columns (total 16 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   state                      1453 non-null   object 
 1   num_colonies               1453 non-null   int64  
 2   max_colonies               1453 non-null   int64  
 3   lost_colonies              1453 non-null   int64  
 4   percent_lost               1453 non-null   int64  
 5   added_colonies             1453 non-null   int64  
 6   renovated_colonies         1453 non-null   int64  
 7   percent_renovated          1453 non-null   int64  
 8   quarter                    1453 non-null   int64  
 9   year                       1453 non-null   int64  
 10  varroa_mites               1453 non-null   float64
 11  other_pests_and_parasites  1453 non-null   float64
 12  diseases                   1453 non-null   float64
 13  pes

In [212]:
missing_values = df.isnull().sum()
print('The Numbers Missing Values:')
print(missing_values)

The Numbers Missing Values:
state                        0
num_colonies                 0
max_colonies                 0
lost_colonies                0
percent_lost                 0
added_colonies               0
renovated_colonies           0
percent_renovated            0
quarter                      0
year                         0
varroa_mites                 0
other_pests_and_parasites    0
diseases                     0
pesticides                   0
other                        0
unknown                      0
dtype: int64


In [213]:
print('The states in the dataset')
df.state.unique()

The states in the dataset


array(['Alabama', 'Arizona', 'Arkansas', 'California', 'Colorado',
       'Connecticut', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois',
       'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine',
       'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'New Jersey',
       'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio',
       'Oklahoma', 'Oregon', 'Pennsylvania', 'South Carolina',
       'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont',
       'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming',
       'Other', 'United States'], dtype=object)

In [214]:
print("Removing the last row of the dataset as it is the summary total of the values" )
df = df[df['state'] != 'United States']
df


Removing the last row of the dataset as it is the summary total of the values


Unnamed: 0,state,num_colonies,max_colonies,lost_colonies,percent_lost,added_colonies,renovated_colonies,percent_renovated,quarter,year,varroa_mites,other_pests_and_parasites,diseases,pesticides,other,unknown
0,Alabama,7000,7000,1800,26,2800,250,4,1,2015,10.0,5.4,0.0,2.2,9.1,9.4
1,Arizona,35000,35000,4600,13,3400,2100,6,1,2015,26.9,20.5,0.1,0.0,1.8,3.1
2,Arkansas,13000,14000,1500,11,1200,90,1,1,2015,17.6,11.4,1.5,3.4,1.0,1.0
3,California,1440000,1690000,255000,15,250000,124000,7,1,2015,24.7,7.2,3.0,7.5,6.5,2.8
4,Colorado,3500,12500,1500,12,200,140,1,1,2015,14.6,0.9,1.8,0.6,2.6,5.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1447,Washington,84000,89000,7500,8,540,220,0,4,2022,34.0,6.4,0.0,0.0,4.2,8.2
1448,West Virginia,7500,8000,1100,14,0,220,3,4,2022,33.4,3.8,0.8,0.0,6.4,0.5
1449,Wisconsin,26000,47000,3500,7,140,380,1,4,2022,23.2,21.4,19.4,17.5,9.9,11.7
1450,Wyoming,19500,21000,3200,15,640,0,0,4,2022,22.9,5.9,4.2,0.0,0.0,7.4


In [215]:
print('The Number of Duplicated Rows=')
print(df.duplicated().sum())

The Number of Duplicated Rows=
0


In [216]:
print('The first five(5) rows:')
print(df.head(5))

The first five(5) rows:
        state  num_colonies  max_colonies  lost_colonies  percent_lost  \
0     Alabama          7000          7000           1800            26   
1     Arizona         35000         35000           4600            13   
2    Arkansas         13000         14000           1500            11   
3  California       1440000       1690000         255000            15   
4    Colorado          3500         12500           1500            12   

   added_colonies  renovated_colonies  percent_renovated  quarter  year  \
0            2800                 250                  4        1  2015   
1            3400                2100                  6        1  2015   
2            1200                  90                  1        1  2015   
3          250000              124000                  7        1  2015   
4             200                 140                  1        1  2015   

   varroa_mites  other_pests_and_parasites  diseases  pesticides  other  \
0    

In [217]:
print('The last five(5) rows:',df.tail(5))

The last five(5) rows:               state  num_colonies  max_colonies  lost_colonies  percent_lost  \
1447     Washington         84000         89000           7500             8   
1448  West Virginia          7500          8000           1100            14   
1449      Wisconsin         26000         47000           3500             7   
1450        Wyoming         19500         21000           3200            15   
1451          Other         30030         30030            480             2   

      added_colonies  renovated_colonies  percent_renovated  quarter  year  \
1447             540                 220                  0        4  2022   
1448               0                 220                  3        4  2022   
1449             140                 380                  1        4  2022   
1450             640                   0                  0        4  2022   
1451            1190                 130                  0        4  2022   

      varroa_mites  other_p

# Statistical Analysis

In [218]:
summary = df.describe()
summary

Unnamed: 0,num_colonies,max_colonies,lost_colonies,percent_lost,added_colonies,renovated_colonies,percent_renovated,quarter,year,varroa_mites,other_pests_and_parasites,diseases,pesticides,other,unknown
count,1422.0,1422.0,1422.0,1422.0,1422.0,1422.0,1422.0,1422.0,1422.0,1422.0,1422.0,1422.0,1422.0,1422.0,1422.0
mean,62997.36,79641.99,8495.949367,11.16948,8119.824191,6990.731364,7.075246,2.505626,2018.47398,29.900774,10.866807,3.346273,6.076653,6.008087,3.967511
std,147889.7,187851.9,23179.134163,7.440506,22984.40268,21946.942624,9.059695,1.132793,2.322751,18.92901,13.156143,6.519937,9.009384,6.52318,4.985715
min,1300.0,1300.0,0.0,0.0,0.0,0.0,0.0,1.0,2015.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,8000.0,9500.0,930.0,6.0,370.0,150.0,1.0,1.0,2016.0,15.4,1.8,0.1,0.3,1.7,0.8
50%,17750.0,22000.0,2150.0,10.0,1500.0,755.0,4.0,3.0,2018.0,26.8,6.6,1.0,2.55,4.0,2.3
75%,51750.0,69750.0,6500.0,14.0,5500.0,3300.0,10.0,4.0,2021.0,41.675,14.95,3.8,8.1,8.0,5.3
max,1440000.0,1710000.0,275000.0,65.0,250000.0,285000.0,77.0,4.0,2022.0,98.8,91.9,87.4,73.5,61.4,46.2


# Exploratory Analysis 

In [219]:
df

Unnamed: 0,state,num_colonies,max_colonies,lost_colonies,percent_lost,added_colonies,renovated_colonies,percent_renovated,quarter,year,varroa_mites,other_pests_and_parasites,diseases,pesticides,other,unknown
0,Alabama,7000,7000,1800,26,2800,250,4,1,2015,10.0,5.4,0.0,2.2,9.1,9.4
1,Arizona,35000,35000,4600,13,3400,2100,6,1,2015,26.9,20.5,0.1,0.0,1.8,3.1
2,Arkansas,13000,14000,1500,11,1200,90,1,1,2015,17.6,11.4,1.5,3.4,1.0,1.0
3,California,1440000,1690000,255000,15,250000,124000,7,1,2015,24.7,7.2,3.0,7.5,6.5,2.8
4,Colorado,3500,12500,1500,12,200,140,1,1,2015,14.6,0.9,1.8,0.6,2.6,5.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1447,Washington,84000,89000,7500,8,540,220,0,4,2022,34.0,6.4,0.0,0.0,4.2,8.2
1448,West Virginia,7500,8000,1100,14,0,220,3,4,2022,33.4,3.8,0.8,0.0,6.4,0.5
1449,Wisconsin,26000,47000,3500,7,140,380,1,4,2022,23.2,21.4,19.4,17.5,9.9,11.7
1450,Wyoming,19500,21000,3200,15,640,0,0,4,2022,22.9,5.9,4.2,0.0,0.0,7.4


## Trend over Time 

In [220]:
grouped_year = df.groupby('year')['num_colonies'].sum().reset_index()
highest_colonies = grouped_year.sort_values(by='num_colonies',ascending=False)
highest_colonies.reset_index(drop=True, inplace=True)
print('Year with the highest colonies:')
print(highest_colonies.head(10))


Year with the highest colonies:
   year  num_colonies
0  2020      12158770
1  2021      11997940
2  2022      11780420
3  2015      11681750
4  2016      11634650
5  2017      11179510
6  2018      10283660
7  2019       8865540


In [221]:
grouped_year = df.groupby('year')['num_colonies'].sum().reset_index()
lowest_colonies = grouped_year.sort_values(by='num_colonies',ascending=True)
lowest_colonies.reset_index(drop=True, inplace=True)
print('Year with the lowest colonies:')
print(lowest_colonies.head(10))

Year with the lowest colonies:
   year  num_colonies
0  2019       8865540
1  2018      10283660
2  2017      11179510
3  2016      11634650
4  2015      11681750
5  2022      11780420
6  2021      11997940
7  2020      12158770


In [222]:
grouped_year = df.groupby('year')['added_colonies'].sum().reset_index()
highest_added_colonies = grouped_year.sort_values(by='added_colonies',ascending=False)
highest_added_colonies.reset_index(drop=True, inplace=True)
print('Year with the highest added colonies:')
print(highest_added_colonies.head(10))

Year with the highest added colonies:
   year  added_colonies
0  2018         1778600
1  2020         1667110
2  2016         1650780
3  2017         1580480
4  2015         1498980
5  2021         1372060
6  2022         1264860
7  2019          733520


In [223]:
grouped_year = df.groupby('year')['added_colonies'].sum().reset_index()
lowest_added_colonies = grouped_year.sort_values(by='added_colonies',ascending=True)
lowest_added_colonies.reset_index(drop=True, inplace=True)
print('Year with the lowest added colonies:')
print(lowest_added_colonies.head(10))

Year with the lowest added colonies:
   year  added_colonies
0  2019          733520
1  2022         1264860
2  2021         1372060
3  2015         1498980
4  2017         1580480
5  2016         1650780
6  2020         1667110
7  2018         1778600


In [224]:
grouped_year = df.groupby('year')['lost_colonies'].sum().reset_index()
highest_lost_colonies = grouped_year.sort_values(by='lost_colonies',ascending=False)
highest_lost_colonies.reset_index(drop=True, inplace=True)
print('Year with the highest colony loss:')
print(highest_lost_colonies.head(10))

Year with the highest colony loss:
   year  lost_colonies
0  2015        1722360
1  2016        1645560
2  2020        1612510
3  2018        1520460
4  2017        1503910
5  2021        1441690
6  2022        1392840
7  2019        1241910


In [225]:
grouped_year = df.groupby('year')['lost_colonies'].sum().reset_index()
lowest_lost_colonies = grouped_year.sort_values(by='lost_colonies',ascending=True)
lowest_lost_colonies.reset_index(drop=True, inplace=True)
print('Year with the lowest colony loss:')
print(lowest_lost_colonies.head(10))

Year with the lowest colony loss:
   year  lost_colonies
0  2019        1241910
1  2022        1392840
2  2021        1441690
3  2017        1503910
4  2018        1520460
5  2020        1612510
6  2016        1645560
7  2015        1722360


In [226]:
### Visualization of the total number of colonies over time 
fig = px.histogram(df,x=df.year,y=df.num_colonies,
            title='The Total Number of Colonies Per Year(2015-2022)',
            width=500,
            height=400,
            labels={'year': 'Year'})


### Visualization of the added colonies over the years 
fig2 = px.histogram(df,x=df.year,y=df.added_colonies,
            title='Added Colonies Per Year(2015-2022)',
            width=500,
            height=400,
            labels={'year': 'Year'})

### Visualization of the lost colonies over the years 
fig3 = px.histogram(df,x=df.year,y=df.lost_colonies,
            title='Lost Colonies Per Year(2015-2022)',
            width=500,
            height=400,
            labels={'year': 'Year'})


### Visualization of the percentage lost over time 
fig4 = px.box(df,x=df.year,y=df.percent_lost,
            title='The Percentage Lost Per Year(2015-2022)',
            width=500,
            height=400,
            labels={'year': 'Year'})
fig.show(),fig2.show()


fig3.show()

fig4.show()

In [227]:
print('The Year with the height number of Colonies was 2020 with 24.32m colonies while the year with the lowest Colonies was 2019 with 17.73m coloinies, and looking at the lost colonies over the years it was discovered that 3.44m colonies were lost in 2015 which was the highest year loss and 2019 lost the lowest colonies of 2.48m')

The Year with the height number of Colonies was 2020 with 24.32m colonies while the year with the lowest Colonies was 2019 with 17.73m coloinies, and looking at the lost colonies over the years it was discovered that 3.44m colonies were lost in 2015 which was the highest year loss and 2019 lost the lowest colonies of 2.48m


## Trends by States 

In [228]:
grouped_states = df.groupby('state')['num_colonies'].sum().reset_index()
highest_colonies = grouped_states.sort_values(by='num_colonies',ascending=False)
highest_colonies.reset_index(drop=True, inplace=True)
print('States with the highest colonies:')
print(highest_colonies.head(10))



States with the highest colonies:
          state  num_colonies
0    California      27910000
1  North Dakota       7833000
2       Florida       7682000
3         Texas       6460000
4       Georgia       3792000
5  South Dakota       3329000
6         Idaho       3143000
7        Oregon       2669000
8     Minnesota       2451000
9       Montana       2319500


In [229]:
grouped_states = df.groupby('state')['added_colonies'].sum().reset_index()
highest_added_colonies = grouped_states.sort_values(by='added_colonies',ascending=False)
highest_added_colonies.reset_index(drop=True, inplace=True)
print('States with the highest added colonies:')
print(highest_added_colonies.head(10))

States with the highest added colonies:
          state  added_colonies
0    California         3682000
1       Florida         1369000
2         Texas         1286820
3       Georgia          664200
4  North Dakota          449800
5         Idaho          357550
6        Oregon          265200
7     Minnesota          248460
8    Washington          234960
9   Mississippi          229610


In [230]:
grouped_states = df.groupby('state')['added_colonies'].sum().reset_index()
lowest_added_colonies = grouped_states.sort_values(by='added_colonies',ascending=True)
lowest_added_colonies.reset_index(drop=True, inplace=True)
print('States with the lowest added colonies:')
print(lowest_added_colonies.head(10))

States with the lowest added colonies:
           state  added_colonies
0    Connecticut           10490
1          Maine           12550
2        Vermont           16020
3  Massachusetts           18600
4  West Virginia           23810
5     New Jersey           25870
6          Other           30420
7         Hawaii           32450
8       Kentucky           32470
9       Virginia           35380


In [231]:
grouped_states = df.groupby('state')['lost_colonies'].sum().reset_index()
highest_lost_colonies = grouped_states.sort_values(by='lost_colonies',ascending=False)
highest_lost_colonies.reset_index(drop=True, inplace=True)
print('States with the highest colony loss:')
print(highest_lost_colonies.head(10))

States with the highest colony loss:
          state  lost_colonies
0    California        4098000
1       Florida        1161000
2  North Dakota         949860
3         Texas         822000
4       Georgia         566500
5  South Dakota         488740
6         Idaho         409700
7     Minnesota         305020
8      Michigan         272200
9    Washington         258900


In [232]:
grouped_states = df.groupby('state')['lost_colonies'].sum().reset_index()
lowest_lost_colonies = grouped_states.sort_values(by='lost_colonies',ascending=True)
lowest_lost_colonies.reset_index(drop=True, inplace=True)
print('States with the lowest colony loss:')
print(lowest_lost_colonies.head(10))

States with the lowest colony loss:
           state  lost_colonies
0    Connecticut           7630
1        Vermont          10020
2         Hawaii          20740
3     New Jersey          21490
4  Massachusetts          21550
5  West Virginia          27130
6       Missouri          30160
7       Maryland          31020
8       Virginia          31780
9          Other          33380


In [235]:
### Visualization of the total number of colonies by states 
fig4 = px.histogram(df,x=df.state,y=df.num_colonies,
            title='The Total Number of Colonies By State(2015-2022)',
            width=400,
            height=400,
            labels={'year': 'Year'})

### Visualization addede colonies by states 
fig5 = px.histogram(df,x=df.state,y=df.added_colonies,
            title='Added Colonies By State(2015-2022)',
            width=400,
            height=400,
            labels={'year': 'Year'})

### Visualization of lost colonies by states 
fig6 = px.histogram(df,x=df.state,y=df.lost_colonies,
            title='Lost Colonies By State(2015-2022)',
            width=400,
            height=400,
            labels={'year': 'Year'})

### Visualization of the percentage lost of colonies by states 
fig7 = px.box(df,x=df.state,y=df.percent_lost,
            title='The Percentage Loss of Colonies By State(2015-2022)',
            width=800,
            height=400,
            labels={'year': 'Year'})


fig4.show()
fig5.show()
fig6.show()
fig7.show()

# Predictive Analysis 

In [238]:
numeric_df = df.select_dtypes(include='number')  # Select only numeric columns
correlation_matrix = numeric_df.corr()

correlation_matrix

Unnamed: 0,num_colonies,max_colonies,lost_colonies,percent_lost,added_colonies,renovated_colonies,percent_renovated,quarter,year,varroa_mites,other_pests_and_parasites,diseases,pesticides,other,unknown
num_colonies,1.0,0.962706,0.925137,-0.00503,0.851448,0.804091,0.091797,0.016175,0.0089,0.117941,0.024016,0.1102,0.143146,0.103379,0.032943
max_colonies,0.962706,1.0,0.944162,-0.030042,0.834487,0.763404,0.080503,0.007756,0.004301,0.102503,0.016238,0.103955,0.125311,0.106823,0.019257
lost_colonies,0.925137,0.944162,1.0,0.093491,0.823493,0.691184,0.07787,0.004445,-0.017319,0.123947,0.032601,0.101681,0.14443,0.125362,0.062486
percent_lost,-0.00503,-0.030042,0.093491,1.0,0.011183,-0.011982,0.009081,-0.077887,-0.081774,0.233148,0.082251,0.164727,0.161779,0.317742,0.280056
added_colonies,0.851448,0.834487,0.823493,0.011183,1.0,0.803468,0.196311,-0.129555,-0.021201,0.092946,0.035245,0.086647,0.119335,0.109622,0.060191
renovated_colonies,0.804091,0.763404,0.691184,-0.011982,0.803468,1.0,0.35399,-0.057147,-0.034689,0.128046,0.056648,0.092881,0.1321,0.141384,0.033523
percent_renovated,0.091797,0.080503,0.07787,0.009081,0.196311,0.35399,1.0,-0.074544,-0.013702,0.228645,0.196003,0.083349,0.104301,0.145312,0.014352
quarter,0.016175,0.007756,0.004445,-0.077887,-0.129555,-0.057147,-0.074544,1.0,0.005137,0.20838,0.129627,0.075085,0.148016,-0.031,0.02124
year,0.0089,0.004301,-0.017319,-0.081774,-0.021201,-0.034689,-0.013702,0.005137,1.0,-0.019058,-0.035997,-0.081547,-0.095729,-0.051998,-0.052255
varroa_mites,0.117941,0.102503,0.123947,0.233148,0.092946,0.128046,0.228645,0.20838,-0.019058,1.0,0.566468,0.324013,0.433359,0.31265,0.169548


In [257]:
model1 = RandomForestClassifier(n_estimators=100,random_state=20)
model1

### Independent Variables

In [250]:
X = df[['state', 'num_colonies', 'max_colonies',
       'percent_lost', 'added_colonies', 'renovated_colonies',
       'percent_renovated', 'quarter', 'year', 'varroa_mites',
       'other_pests_and_parasites', 'diseases', 'pesticides', 'other',
       'unknown']]
x = pd.get_dummies(X).astype(int)
x

Unnamed: 0,num_colonies,max_colonies,percent_lost,added_colonies,renovated_colonies,percent_renovated,quarter,year,varroa_mites,other_pests_and_parasites,...,state_South Dakota,state_Tennessee,state_Texas,state_Utah,state_Vermont,state_Virginia,state_Washington,state_West Virginia,state_Wisconsin,state_Wyoming
0,7000,7000,26,2800,250,4,1,2015,10,5,...,0,0,0,0,0,0,0,0,0,0
1,35000,35000,13,3400,2100,6,1,2015,26,20,...,0,0,0,0,0,0,0,0,0,0
2,13000,14000,11,1200,90,1,1,2015,17,11,...,0,0,0,0,0,0,0,0,0,0
3,1440000,1690000,15,250000,124000,7,1,2015,24,7,...,0,0,0,0,0,0,0,0,0,0
4,3500,12500,12,200,140,1,1,2015,14,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1447,84000,89000,8,540,220,0,4,2022,34,6,...,0,0,0,0,0,0,1,0,0,0
1448,7500,8000,14,0,220,3,4,2022,33,3,...,0,0,0,0,0,0,0,1,0,0
1449,26000,47000,7,140,380,1,4,2022,23,21,...,0,0,0,0,0,0,0,0,1,0
1450,19500,21000,15,640,0,0,4,2022,22,5,...,0,0,0,0,0,0,0,0,0,1


### Target

In [246]:
y = df['lost_colonies']
y

0         1800
1         4600
2         1500
3       255000
4         1500
         ...  
1447      7500
1448      1100
1449      3500
1450      3200
1451       480
Name: lost_colonies, Length: 1422, dtype: int64

## Model Training

In [252]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25, random_state=20)

In [258]:
model1.fit(x_train,y_train)

In [260]:
predict = model1.predict(x)

array([1800, 4600, 1500, ..., 3500, 3200,  630])

In [272]:
## Adding the predicted values to the column on the dataframe
df['predict'] = predict

In [274]:
print('The pedicted values for the lost colonies compred to the actual values')
checking = df[['predict','lost_colonies']]
checking

The pedicted values for the lost colonies compred to the actual values


Unnamed: 0,predict,lost_colonies
0,1800,1800
1,4600,4600
2,1500,1500
3,255000,255000
4,1500,1500
...,...,...
1447,7500,7500
1448,1100,1100
1449,3500,3500
1450,3200,3200


In [276]:
coefficient_determination = r2_score(y,predict)
print('The Coefficient of Determination is')
print(coefficient_determination)
print('This means that the level of accuracy is 97.27% ')
                                    

The Coefficient of Determination is
0.9727100853309566
This means that the level of accuracy is 97.27% 


In [270]:
model2 = sm.OLS(y,sm.add_constant(x))
result = model2.fit()
print('The result for the Ordinary Least Squares')
print(result.summary())

The result fot the Ordinary Least Squares
                            OLS Regression Results                            
Dep. Variable:          lost_colonies   R-squared:                       0.936
Model:                            OLS   Adj. R-squared:                  0.933
Method:                 Least Squares   F-statistic:                     335.4
Date:                Mon, 04 Dec 2023   Prob (F-statistic):               0.00
Time:                        19:26:04   Log-Likelihood:                -14360.
No. Observations:                1422   AIC:                         2.884e+04
Df Residuals:                    1362   BIC:                         2.916e+04
Df Model:                          59                                         
Covariance Type:            nonrobust                                         
                                coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------

 # Insights





 In the OLS regression results from my research, I've analyzed the coefficients and p-values for variables related to parasites and diseases. However, it's important to note that there's no direct variable for climate change in my model. Here's what I've discovered:
 

When examining Varroa Mites, a type of parasite, the coefficient is -13.1089, and the p-value is 0.281. This suggests that for each unit increase in varroa mites, the number of lost colonies decreases by 13.1089. However, the p-value is greater than 0.05, which means this result is not statistically significant at the 95% confidence level. Therefore, I can't confidently conclude that varroa mites have a significant impact on lost colonies based on my model.



Regarding other pests and parasites, the coefficient is 7.6791, and the p-value is 0.680. This suggests that for each unit increase in other pests and parasites, the number of lost colonies increases by 7.6791. However, similar to Varroa Mites, the p-value for this variable is above the 0.05 threshold. Therefore, I can't confidently conclude that these pests and parasites significantly contribute to lost colonies.



On the other hand, diseases appear to play a significant role. The coefficient is -66.5464, and the p-value is 0.023. This suggests that for each unit increase in diseases, the number of lost colonies decreases by 66.5464. The p-value is less than 0.05, which means this result is statistically significant at the 95% confidence level. Therefore, I can confidently say that diseases have a significant impact on lost colonies based on my model.


# Limitation


As for climate change, it's not directly included in my model. However, it's important to note that climate change could indirectly affect some of the variables in my model. For instance, climate change could potentially influence the prevalence of pests, parasites, and diseases. To directly measure the impact of climate change, I would need to include variables that capture aspects of climate change, such as temperature changes or precipitation changes, in my model.

# Conclusion

In conclusion, my research has provided valuable insights into the factors contributing to the loss of bee colonies. The Ordinary Least Squares (OLS) regression model I used revealed that diseases have a significant impact on lost colonies. However, the impact of parasites such as Varroa Mites and other pests was not statistically significant based on my model.



In terms of the number of colonies, the year 2020 had the highest number of colonies, while 2019 had the lowest. When it comes to added colonies, 2018 saw the highest increase, while 2019 had the lowest. Interestingly, 2015 had the highest colony loss, and 2019 had the lowest.



On a state level, California had the highest number of colonies and the highest colony loss. It also had the highest number of added colonies. On the other hand, Connecticut had the lowest number of added colonies, and the state with the lowest colony loss was also Connecticut.



While my model did not directly include a variable for climate change, it's important to note that climate change could indirectly affect some of the variables in my model. For instance, climate change could potentially influence the prevalence of pests, parasites, and diseases. To directly measure the impact of climate change, I would need to include variables that capture aspects of climate change, such as temperature changes or precipitation changes, in my model.



Overall, my research has shed light on the complex factors contributing to the loss of bee colonies. It underscores the importance of disease control in maintaining healthy bee colonies and highlights the potential indirect effects of climate change on colony health. Future research should consider including direct measures of climate change to further elucidate its impact on bee colonies.