# Co-training

**Autori:** Peter Macinec, Lukas Janik, Vajk Pomichal, Frantisek Sefcik

## Zakladne nastavenia a import kniznic

In [1]:
import pandas as pd
import numpy as np


# plots
import matplotlib.pyplot as plt
import seaborn as sns

import json

import re

import nltk
from nltk.stem.porter import PorterStemmer
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection as ms

from sklearn.metrics import confusion_matrix

pd.options.mode.chained_assignment = None  # default='warn'

### Nacitanie datasetu

Nase data su dostupne v dvoch suboroch, *train.tsv* a *test.tsv*. Nacitame ich oba a vykoname na nich zakladnu analyzu. Zdroj: https://www.kaggle.com/c/stumbleupon

In [2]:
# trenovacie data
df = pd.read_csv('data/train.tsv', sep='\t')

In [3]:
# testovacie data
df_t = pd.read_csv('data/test.tsv', sep='\t')

## Textove atributy

Najskor predspracujeme text. Ziskame ho z atributu boilerplate:

In [4]:
df['body_content'] = df['boilerplate'].apply(lambda x: json.loads(x)['body'])

Teraz odstranime vsetky znaky, ktore nie su znaky slov. Pouzijeme na to regularne vyrazy:

In [5]:
df['body_content'].replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)

In [6]:
df['body_content'] = df['body_content'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', str(x)))

Este ako male upravy, aby nas slovnik obsahoval kazde slovo len raz, dame ich vsetky na lowercase a rozdelime texty na slova, aby sme ich nasledne mohli spracovat:

In [7]:
df['body_content'] = df['body_content'].apply(lambda x: str(x).lower().split())

Teraz potrebujeme este odstranit slova, ktore nedavaju vyznam. O jednom raze prevedieme slova na ich korenovy zaklad pouzitim stemmingu:

In [8]:
porter_stemmer = PorterStemmer()
stopwords = set(stopwords.words('english'))

In [9]:
df['body_content'] = df['body_content'].apply(lambda x: [porter_stemmer.stem(word) for word in x if word not in stopwords])

Teraz uz mame vsetky slova pripravene, uz ich len naspat spojime do jednej suvislej vety, aby s nimi vedeli lahsie pracovat algoritmy spracovania textu:

In [10]:
df['body_content_final'] = df['body_content'].apply(lambda x: ' '.join(x))

### TF-IDF

In [11]:
tv = TfidfVectorizer(max_features = 1000)
tf_idf = tv.fit_transform(df['body_content_final']).toarray()

### Atribut URL

In [12]:
df['url_new'] = df['url'].apply(lambda x: re.sub('[^a-zA-Z]', ' ', str(x)))
df['url_new'] = df['url_new'].apply(lambda x: str(x).lower().split())

In [13]:
df['url_new'] = df['url_new'].apply(lambda x: [porter_stemmer.stem(word) for word in x if word not in stopwords])

In [14]:
df['url_final'] = df['url_new'].apply(lambda x: ' '.join(x))

In [15]:
tv_url = TfidfVectorizer(max_features = 1000)
tf_idf_url = tv_url.fit_transform(df['url_final']).toarray()

## Numericke atributy

In [16]:
num_feature_set = ['avglinksize', 'commonlinkratio_1', 'commonlinkratio_2', 'commonlinkratio_3', 'commonlinkratio_4',
                   'hasDomainLink','lengthyLinkDomain','linkwordscore','numberOfLinks',
                   'numwords_in_url', 'parametrizedLinkRatio']

## Benchmark model

Natrenujeme benchmarkovy model, ktory bude natrenovany na vsetkych atributoch. Jeho vysledky sa nasledne budeme snazit dosiahnut s minimom oznacenych dat s co-trainingom.

In [17]:
y = df.label

In [18]:
df1 = df.loc[:, num_feature_set]

In [19]:
X = pd.concat([df1, pd.DataFrame(tf_idf)], axis=1, join_axes=[df1.index])
X = pd.concat([X, pd.DataFrame(tf_idf_url)], axis=1, join_axes=[X.index])

In [20]:
X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size=0.2, random_state=42)

In [21]:
benchmark_clf = RandomForestClassifier(n_estimators=500, max_depth=20,
                              random_state=10)

In [22]:
benchmark_clf.fit(X_train, y_train)
y_pred = benchmark_clf.predict(X_test)

In [23]:
confusion_matrix(y_test, y_pred)

array([[655,  75],
       [218, 531]], dtype=int64)

In [24]:
accuracy_score(y_test, y_pred)

0.801893171061528