# Co-training

**Autori:** Peter Macinec, Lukas Janik, Vajk Pomichal, Frantisek Sefcik

## Deskriptivna analyza

### Zakladne nastavenia a import kniznic

In [1]:
import pandas as pd
import numpy as np


# plots
import matplotlib.pyplot as plt
import seaborn as sns

### Nacitanie datasetu

Nase data su dostupne v dvoch suboroch, *train.tsv* a *test.tsv*. Nacitame ich oba a vykoname na nich zakladnu analyzu. 

In [2]:
# trenovacie data
df = pd.read_csv('data/train.tsv', sep='\t')

In [3]:
# testovacie data
df_t = pd.read_csv('data/test.tsv', sep='\t')

### Zakladna analyza datasetov

Na zaciatok sa pozrime, kolko nase dva datasety obsahuju zaznamov. Najprv trenovacie data:

In [4]:
len(df)

7395

A testovacie data:

In [5]:
len(df_t)

3171

Podme na tieto data nahliadnut:

In [6]:
df.head()

Unnamed: 0,url,urlid,boilerplate,alchemy_category,alchemy_category_score,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,...,is_news,lengthyLinkDomain,linkwordscore,news_front_page,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label
0,http://www.bloomberg.com/news/2010-12-23/ibm-p...,4042,"{""title"":""IBM Sees Holographic Calls Air Breat...",business,0.789131,2.055556,0.676471,0.205882,0.047059,0.023529,...,1,1,24,0,5424,170,8,0.152941,0.07913,0
1,http://www.popsci.com/technology/article/2012-...,8471,"{""title"":""The Fully Electronic Futuristic Star...",recreation,0.574147,3.677966,0.508021,0.28877,0.213904,0.144385,...,1,1,40,0,4973,187,9,0.181818,0.125448,1
2,http://www.menshealth.com/health/flu-fighting-...,1164,"{""title"":""Fruits that Fight the Flu fruits tha...",health,0.996526,2.382883,0.562016,0.321705,0.120155,0.042636,...,1,1,55,0,2240,258,11,0.166667,0.057613,1
3,http://www.dumblittleman.com/2007/12/10-foolpr...,6684,"{""title"":""10 Foolproof Tips for Better Sleep ""...",health,0.801248,1.543103,0.4,0.1,0.016667,0.0,...,1,0,24,0,2737,120,5,0.041667,0.100858,1
4,http://bleacherreport.com/articles/1205138-the...,9006,"{""title"":""The 50 Coolest Jerseys You Didn t Kn...",sports,0.719157,2.676471,0.5,0.222222,0.123457,0.04321,...,1,1,14,0,12032,162,10,0.098765,0.082569,0


Vidime, ze data obsahuju rozne atributy s roznymi datovymi typmi. Aj ciselne atributy, aj kategoricke, dokonca niektore reprezentovane ako objekt. Pozrime sa na datove typy vsetkych atributov:

In [7]:
df.dtypes

url                                object
urlid                               int64
boilerplate                        object
alchemy_category                   object
alchemy_category_score             object
avglinksize                       float64
commonlinkratio_1                 float64
commonlinkratio_2                 float64
commonlinkratio_3                 float64
commonlinkratio_4                 float64
compression_ratio                 float64
embed_ratio                       float64
framebased                          int64
frameTagRatio                     float64
hasDomainLink                       int64
html_ratio                        float64
image_ratio                       float64
is_news                            object
lengthyLinkDomain                   int64
linkwordscore                       int64
news_front_page                    object
non_markup_alphanum_characters      int64
numberOfLinks                       int64
numwords_in_url                   

Taktiez bude pre nas zaujimavy pocet chybajucich hodnot v datach, aby sme vedeli urcit, ci pre nas jednotlive atributy su relevantne:

In [8]:
df.isnull().values.any()

False

Zda sa, ze v datach sa nenachadzaju ziadne chybajuce hodnoty. Niektore atributy vsak mozu obsahovat chybajuce hodnoty, ktore budu reprezentovane nejakou hodnotou. Na tento pripad si treba dat pozor. Ak takyto pripad nastane, mali by sme ho odhalit v explorativnej analyze, kde sa pozrieme na jednotlive atributy blizsie.

Pozrime sa este na vsetky atributy v skratke a ich agregovane veliciny. Najskor opisme numericke atributy:

In [9]:
df.describe()

Unnamed: 0,urlid,avglinksize,commonlinkratio_1,commonlinkratio_2,commonlinkratio_3,commonlinkratio_4,compression_ratio,embed_ratio,framebased,frameTagRatio,...,html_ratio,image_ratio,lengthyLinkDomain,linkwordscore,non_markup_alphanum_characters,numberOfLinks,numwords_in_url,parametrizedLinkRatio,spelling_errors_ratio,label
count,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,...,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0,7395.0
mean,5305.704665,2.761823,0.46823,0.21408,0.092062,0.049262,2.255103,-0.10375,0.0,0.056423,...,0.233778,0.275709,0.660311,30.077079,5716.598242,178.754564,4.960649,0.172864,0.101221,0.51332
std,3048.384114,8.619793,0.203133,0.146743,0.095978,0.072629,5.704313,0.306545,0.0,0.041446,...,0.052487,1.91932,0.473636,20.393101,8875.43243,179.466198,3.233111,0.183286,0.079231,0.499856
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,...,0.045564,-1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,2688.5,1.602062,0.34037,0.105263,0.022222,0.0,0.442616,0.0,0.0,0.028502,...,0.201061,0.0259,0.0,14.0,1579.0,82.0,3.0,0.040984,0.068739,0.0
50%,5304.0,2.088235,0.481481,0.202454,0.068627,0.022222,0.48368,0.0,0.0,0.045775,...,0.230564,0.083051,1.0,25.0,3500.0,139.0,5.0,0.113402,0.089312,1.0
75%,7946.5,2.627451,0.616604,0.3,0.133333,0.065065,0.578227,0.0,0.0,0.073459,...,0.26077,0.2367,1.0,43.0,6377.0,222.0,7.0,0.241299,0.112376,1.0
max,10566.0,363.0,1.0,1.0,0.980392,0.980392,21.0,0.25,0.0,0.444444,...,0.716883,113.333333,1.0,100.0,207952.0,4997.0,22.0,1.0,1.0,1.0


Este sa pozrime na kategoricke atributy:

In [10]:
df.describe(exclude=[np.number])

Unnamed: 0,url,boilerplate,alchemy_category,alchemy_category_score,is_news,news_front_page
count,7395,7395,7395,7395,7395,7395
unique,7395,7394,14,4806,2,3
top,http://www.goaskalice.columbia.edu/1585.html,"{""title"":""Freebase Pancakes NOTCOT "",""body"":""n...",?,?,1,0
freq,1,2,2342,2342,4552,5853


Z opisu dat vidime, ze sa tu zrejme nachadzaju aj chybajuce hodnoty, ktore su oznacene otaznikom. Taktiez, ze sa v datach mozu nachadzat nejake duplikaty (boilerplate, top ma frekvenciu 2). Vidime taktiez, ze nie vsetky kategoricke atributy su naozaj kategoricke. Napriklad *alchemy_category_score* je zrejme numericky atribut, kde chybajuce hodnoty su nahradene znakom otaznika. Rovnaky problem, ale opacny mohol nastat pri numerickych atributoch, ze niektore numericke atributy predstavuju vlastne len kategorie. To bude treba este preskumat v explorativnej analyze a pripadne to opravit vo faze predspracovania dat.

## Explorativna analyza

V explorativnej analyze sa pozrieme na jednotlive atributy trocha hlbsie. Pokusime sa najst atributy, ktore by mohli napomoct klasifikacii.