# Baseline model

The baseline models serves as a minimum baseline for the performance of to be trained models. A potential baseline could always predict the majority class (no disaster) or randomly predict disaster or no disaster

## Load the data

In [1]:
import pandas as pd

In [2]:
train = pd.read_csv('data/train.csv')
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [3]:
test = pd.read_csv('data/test.csv')
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


## Defining the model

In [4]:
from sklearn.base import BaseEstimator, ClassifierMixin
import numpy

numpy.random.seed(42)
class BaselineModelMajority(BaseEstimator, ClassifierMixin):
    def fit(self, X, y):
        pass
    
    def predict(self, X, y=None):
        return [0]*len(X)
    
class BaselineModelRandom(BaseEstimator, ClassifierMixin):
    def fit(self, X, y):
        pass
    
    def predict(self, X, y=None):
        return numpy.random.randint(2, size=len(X))

In [5]:
X = train.drop(columns=['target'])
y = train['target']

In [6]:
blm = BaselineModelMajority()
blr = BaselineModelRandom()

# Evaluating the baseline models

In [12]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(blm, X, y, cv=5, scoring='f1')
scores

array([0., 0., 0., 0., 0.])

In [13]:
print("F1-Score Baseline Majority class ", scores.mean())

F1-Score Baseline Majority class  0.0


In [14]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(blr, X, y, cv=5, scoring='f1')
scores

array([0.44397163, 0.44892286, 0.45531915, 0.47272727, 0.45743146])

In [15]:
print("F1-Score Random class ", scores.mean())

F1-Score Random class  0.45567447467998967


# Derivation of the F1-Score for the baseline models
The F1-Score is the harmonic mean of average and recall:
$\frac{2}{precision^{-1}+recall^{-1}} = 2 \frac{precision * recall}{precision+recall}$

For the task of finding disaster tweets the majority model provides.
* Precision: 0 
* Recall: 1
* F1-Score: 0

The random model provides (with 42.966% disaster tweets):
* Precision: 42.966%
* Recall: 0.5

In [19]:
print("F1-Score random model:",(2*0.5*0.42966)/(0.5+0.42966))

F1-Score random model: 0.4621689649979563


# Predict with the baseline model

In [29]:
test = pd.read_csv('data/test.csv')
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [30]:
pred = blr.predict(test)

In [31]:
numpy.mean(pred)

0.5102666258044745

In [33]:
submission = pd.DataFrame({"id":test['id'], "target":pred})
submission.head()

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,0


In [35]:
submission.to_csv('submission_baseline.csv', index=False)