### Analisando diferentes características no data set

Analisando diferentes características no data set, como trigrams, tamanho da URL, quantidade de números, etc.
Todas as caracterísitcas são unidas em uma única matriz. 
Embora sejam analisadas várias características, foi optado por utilizar somente os trigrams das URLs.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
import nltk

%matplotlib inline

In [2]:
random_state = 47
np.random.seed(seed=random_state)

In [3]:
df = pd.read_csv('./dmoz.csv', header=None, names=['url', 'category'])

df = df.dropna()

print('Data set len: ', len(df))

print(df.head())

Data set len:  1562975
                                                 url category
1                   http://www.liquidgeneration.com/    Adult
2                        http://www.onlineanime.org/    Adult
3  http://www.ceres.dti.ne.jp/~nekoi/senno/senfir...    Adult
4                         http://www.galeon.com/kmh/    Adult
5                        http://www.fanworkrecs.com/    Adult


In [4]:
dict_cat = {
    'Adult': 0,
    'Arts': 1,
    'Business': 2,
    'Computers': 3,
    'Games': 4,
    'Health': 5,
    'Home': 6,
    'Kids': 7,
    'News': 8,
    'Recreation': 9,
    'Reference': 10,
    'Science': 11,
    'Shopping': 12,
    'Society': 13,
    'Sports': 14
}

def to_category_id(item):
    return dict_cat[item]

In [5]:
df['cat_id'] = df['category'].apply(to_category_id)
print(df.head())

                                                 url category  cat_id
1                   http://www.liquidgeneration.com/    Adult       0
2                        http://www.onlineanime.org/    Adult       0
3  http://www.ceres.dti.ne.jp/~nekoi/senno/senfir...    Adult       0
4                         http://www.galeon.com/kmh/    Adult       0
5                        http://www.fanworkrecs.com/    Adult       0


### Funções auxiliares

In [6]:
def length_char(item):
    return len(remove_special_char(item))

In [7]:
def remove_special_char(item):
    return re.sub(r'\W+', '', item)

In [8]:
regex_replace_http = r'(www[0-9][.])'
def replace_http(item):
    item_r = item.replace('http://', '').replace('https://', '').replace('www.', '')
    return re.sub(regex_replace_http, '', item_r)

In [9]:
def qt_number(item):
    return len(''.join(re.findall(r'[0-9]+', item)))

In [10]:
regex_split = r'[./=~?&+\'\"_;-]+(?=[\w])+'
def qt_tokens(item):
    return len(re.split(regex_split,item))

In [11]:
regex_split = r'[./=~?&+\'\"_;-]+(?=[\w])+'
def average_len_tokens(item):
    tokens = re.split(regex_split,item)
    return sum(len(remove_special_char(token)) for token in tokens) / len(tokens)

In [12]:
def get_hostname_len(item):
    return len(remove_special_char(item.split('/')[0]))

In [13]:
def only_char(item):
    return re.sub('[^A-Za-z]+', '', item)

In [14]:
df['n_url'] = df['url'].apply(replace_http)

In [15]:
df['norm_url'] = df['n_url'].apply(remove_special_char)
df.head()

Unnamed: 0,url,category,cat_id,n_url,norm_url
1,http://www.liquidgeneration.com/,Adult,0,liquidgeneration.com/,liquidgenerationcom
2,http://www.onlineanime.org/,Adult,0,onlineanime.org/,onlineanimeorg
3,http://www.ceres.dti.ne.jp/~nekoi/senno/senfir...,Adult,0,ceres.dti.ne.jp/~nekoi/senno/senfirst.html,ceresdtinejpnekoisennosenfirsthtml
4,http://www.galeon.com/kmh/,Adult,0,galeon.com/kmh/,galeoncomkmh
5,http://www.fanworkrecs.com/,Adult,0,fanworkrecs.com/,fanworkrecscom


In [16]:
df['url_text'] = df['n_url'].apply(only_char)

In [17]:
df.head()

Unnamed: 0,url,category,cat_id,n_url,norm_url,url_text
1,http://www.liquidgeneration.com/,Adult,0,liquidgeneration.com/,liquidgenerationcom,liquidgenerationcom
2,http://www.onlineanime.org/,Adult,0,onlineanime.org/,onlineanimeorg,onlineanimeorg
3,http://www.ceres.dti.ne.jp/~nekoi/senno/senfir...,Adult,0,ceres.dti.ne.jp/~nekoi/senno/senfirst.html,ceresdtinejpnekoisennosenfirsthtml,ceresdtinejpnekoisennosenfirsthtml
4,http://www.galeon.com/kmh/,Adult,0,galeon.com/kmh/,galeoncomkmh,galeoncomkmh
5,http://www.fanworkrecs.com/,Adult,0,fanworkrecs.com/,fanworkrecscom,fanworkrecscom


### Tamanho das urls sem caracteres especiais

In [18]:
df['length'] = df['n_url'].apply(length_char)

### Quantidade de números na URL

In [19]:
df['qt_number'] = df['n_url'].apply(qt_number)

### Quantidade de tokens

In [20]:
df['qt_tokens'] = df['n_url'].apply(qt_tokens)

### Tamanho médio dos tokens

In [21]:
df['a_tokens'] = df['n_url'].apply(average_len_tokens)

### Tamanho do hostna

In [22]:
df['hostname_len'] = df['n_url'].apply(get_hostname_len)

In [23]:
print(df.head(10))

                                                  url category  cat_id  \
1                    http://www.liquidgeneration.com/    Adult       0   
2                         http://www.onlineanime.org/    Adult       0   
3   http://www.ceres.dti.ne.jp/~nekoi/senno/senfir...    Adult       0   
4                          http://www.galeon.com/kmh/    Adult       0   
5                         http://www.fanworkrecs.com/    Adult       0   
6                          http://www.animehouse.com/    Adult       0   
7          http://www2.117.ne.jp/~mb1996ax/enadc.html    Adult       0   
8     http://archive.rhps.org/fritters/yui/index.html    Adult       0   
9                      http://www.freecartoonsex.com/    Adult       0   
10                            http://www.cutepet.org/    Adult       0   

                                         n_url  \
1                        liquidgeneration.com/   
2                             onlineanime.org/   
3   ceres.dti.ne.jp/~nekoi/senno/se

In [24]:
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfTransformer

In [25]:
count_vectorizer = CountVectorizer(analyzer='char', ngram_range=(2,2)).fit(df['norm_url'])
words_vector = count_vectorizer.transform(df['norm_url'])
tf_transformer = TfidfTransformer(norm=None,use_idf=False).fit(words_vector)
urls_tf = tf_transformer.transform(words_vector)

In [26]:
print('TF shape: ', urls_tf.shape)

TF shape:  (1562975, 1375)


In [27]:
data_set = df.ix[:,6:].as_matrix()
print('New features shape: ', data_set.shape)
print('Data set with new features: ', data_set)

New features shape:  (1562975, 5)
Data set with new features:  [[ 19.      0.      2.      9.5    19.   ]
 [ 14.      0.      2.      7.     14.   ]
 [ 34.      0.      8.      4.25   12.   ]
 ..., 
 [ 33.      0.      4.      8.25   23.   ]
 [ 34.      0.      8.      4.125   9.   ]
 [ 22.      0.      4.      5.5    11.   ]]


In [28]:
import scipy as sp

In [29]:
urls_tf = sp.sparse.hstack((urls_tf, data_set[:,:]))

In [30]:
print('Data set shape: ', urls_tf.shape)
print('First item: ', urls_tf.getrow(0).toarray())

Data set shape:  (1562975, 1380)
First item:  [[  0.    0.    0.  ...,   2.    9.5  19. ]]
