# Manejo de datos scrapeados de Sklearn user guide

Estudio de los datos scrapeados de la guia de usuario de Scikit-learn [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Cargo los datos de un archivo pickle previamente descargado por un script de scrapping en Python.

In [228]:
import pickle
import numpy as np

with open('sklearn_guide.plk','rb') as rick:
    df_guide= pickle.load(rick)
df_guide.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649 entries, 0 to 648
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   level0   649 non-null    object
 1   level1   649 non-null    object
 2   content  649 non-null    object
 3   level2   592 non-null    object
 4   level3   344 non-null    object
 5   level4   58 non-null     object
dtypes: object(6)
memory usage: 30.5+ KB


In [229]:
df_guide.head()

Unnamed: 0,level0,level1,content,level2,level3,level4
0,Supervised learning,Neural network models (supervised)¶,The following are a set of methods intended fo...,,,
1,Supervised learning,Linear Models¶,It is possible to constrain all the coefficien...,Ordinary Least Squares¶,Non-Negative Least Squares¶,
2,Supervised learning,Linear Models¶,The least squares solution is computed using t...,Ordinary Least Squares¶,Ordinary Least Squares Complexity¶,
3,Supervised learning,Linear Models¶,LinearRegression fits a linear model with coef...,Ordinary Least Squares¶,,
4,Supervised learning,Linear Models¶,Ridge regression addresses some of the problem...,Ridge regression and classification¶,Regression¶,


## Pretratamiento con spacy

In [6]:
#!python -m spacy download en_core_web_sm

In [217]:
import spacy
nlp = spacy.load("en_core_web_sm")
df_guide.shape

(649, 5)

In [231]:

vocabulary = []
vocabulary_lemma=[]
Vocabulary_no_stop = []
level1_class = []
level2_class = []

for i,row in df_guide.iterrows():
    text = row.loc['level1'] + row.loc['content']
    doc = nlp(text)#uno pregunta y comentario
    content = []
    content_lemma = []
    content_no_stop = []
    for token in doc:
        if (token.is_alpha or token.is_digit) and not token.is_stop:#limpio simbolos de puntuación, me quedo con caracteres y números
            
            content.append(token.text.lower())
            content_lemma.append(token.lemma_.lower())
    
    vocabulary.append(' '.join(content))
    vocabulary_lemma.append(' '.join(content_lemma))

 
    level1_class.append(row[0])
    level2_class.append(row[1].replace('¶',''))



In [232]:
print(vocabulary[1])
print(vocabulary_lemma[1])
print(level1_class[1])
print(len(set(level2_class)))
    

linear possible constrain coefficients non negative useful represent physical naturally non negative quantities frequency counts prices goods linearregression accepts boolean positive parameter set true non negative squares applied example
linear possible constrain coefficient non negative useful represent physical naturally non negative quantity frequency count price good linearregression accept boolean positive parameter set true non negative squares apply example
Supervised learning
49


In [233]:
from sklearn.feature_extraction.text import CountVectorizer


cv = CountVectorizer(binary = True,
                    ngram_range=(1,3))

X_count = cv.fit_transform(vocabulary)
X_count_lemma = cv.fit_transform(vocabulary_lemma)




In [234]:
from sklearn.feature_extraction.text import TfidfVectorizer

ctfid = TfidfVectorizer(
                    ngram_range=(1,3))

X_tfid = ctfid.fit_transform(vocabulary)
X_tfid_lemma = ctfid.fit_transform(vocabulary_lemma)

In [235]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder_l1 = LabelEncoder()
lbl_encoder_l2 = LabelEncoder()

y_level1 = lbl_encoder_l1.fit_transform(level1_class)
y_level2 = lbl_encoder_l2.fit_transform(level2_class)


In [236]:
from sklearn.model_selection import train_test_split


X_train_l1, X_test_l1, y_train_l1, y_test_l1 = train_test_split(X_tfid_lemma,y_level1,random_state=42,shuffle = True)

X_train_l2, X_test_l2, y_train_l2, y_test_l2 = train_test_split(X_tfid_lemma,y_level2,random_state=42,shuffle = True)

Level2:

X_count -> 0,4969

X_count_lemma -> 0,5030

X_tfid -> 0,5030

X_tfid_lemma -> 0,5153

Level1:

X_tfid_lemma -> 0,7975

## Búsqueda de modelos

In [240]:
from sklearn.model_selection import GridSearchCV

def train_model(model,train,target):
    mod = model

    params = {'C':[0.01,0.05,0.25,0.5,1]}

    grid = GridSearchCV(mod,params,cv=5)
    grid.fit(train,target)

    return grid.best_estimator_

In [238]:

from sklearn.svm import LinearSVC

svm_l1 = LinearSVC()

best_svm_l1= train_model(svm_l1,X_train_l1,y_train_l1)

predict = best_svm_l1.predict(X_test_l1)

print('l1',best_svm_l1.score(X_test_l1,y_test_l1))

svm_l2 = LinearSVC()

best_svm_l2= train_model(svm_l2,X_train_l2,y_train_l2)

predict = best_svm_l2.predict(X_test_l2)

print('l2',best_svm_l2.score(X_test_l2,y_test_l2))

l1 0.852760736196319
l2 0.6932515337423313


Entreno el modelo con todos los datos

In [245]:
from sklearn.svm import LinearSVC

svm_l1 = LinearSVC()

best_svm_l1= train_model(svm_l1,X_tfid_lemma,y_level1)

svm_l2 = LinearSVC()

best_svm_l2= train_model(svm_l2,X_tfid_lemma,y_level2)



In [250]:
def cleanPipeline(text, vectorizer,lemma=True):
    doc = nlp(text)#uno pregunta y comentario
    content = []
    for token in doc:
        if (token.is_alpha or token.is_digit) and not token.is_stop:#limpio simbolos de puntuación, me quedo con caracteres y números
            if not lemma:
                content.append(token.text.lower())
            else:
                content.append(token.lemma_.lower())
        resp = ' '.join(content)
    return vectorizer.transform([resp])

inp = input()
vector = cleanPipeline(inp,ctfid)
prediction_l1 = best_svm_l1.predict(vector)
tag_l1 = lbl_encoder_l1.inverse_transform(prediction_l1)
l1 = cleanPipeline(tag_l1[0] + ' ' + inp,ctfid)
prediction_l2 = best_svm_l2.predict(vector)
tag_l2 = lbl_encoder_l2.inverse_transform(prediction_l2)
print(tag_l1,tag_l2)

['Unsupervised learning'] ['Novelty and Outlier Detection']
