## Active Learning 

Active learning is a special case of machine learning in which a learning algorithm can interactively query a user (or some other information source) to label new data points with the desired outputs. Being able to properly utilise active learning will give you a very powerful tool which can be used when there is a shortage of labelled data. 

In [2]:
import pandas as pd
import numpy as np
import re
import time
import bs4 as bs4
import json
import glob
import tqdm
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

pd.set_option('max_columns', 131)

%matplotlib inline
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [None]:
df = pd.read_csv('raw_data_with_labels.csv', index_col=0)
df = df[df['y'].notnull()]
df.shape

In [None]:
df.head()

In [None]:
cleaned_df = pd.DataFrame(index=df.index)

### 1. Cleaning DateTime

In [None]:
#gets day, month and year; extracts creates a df with these three columns for us
cleaned_date = df['watch-time-text'].str.extract(r"(\d+) de ([a-z]+)\. de (\d+)")
#put 0 in front of the days that are less than 10; map applies a function on each element of this specific column 
cleaned_date[0] = cleaned_date[0].map(lambda x: "0"+x[0] if len(x) == 1 else x)
#cleaned_date[1] = cleaned_date[1].map(lambda x: x[0].upper()+x[1:])

month_matcher = {
                    "jan": "Jan",
                    "fev": "Feb",
                    "mar": "Mar",
                    "abr": "Apr",
                    "mai": "May",
                    "jun": "Jun",
                    "jul": "Jul",
                    "ago": "Aug",
                    "set": "Sep",
                    "out": "Oct",
                    "nov": "Nov",
                    "dez": "Dec"
                }

cleaned_date[1] = cleaned_date[1].map(month_matcher)
cleaned_date = cleaned_date.apply(lambda x: " ".join(x, axis=1))
cleaned_date.head()

In [None]:
cleaned_df['date'] = pd.to_datetime(cleaned_date, format="%d %b %Y")
cleaned_df.head()