Обучение логистической регресии.

In [2]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import Pipeline

df = pd.read_csv('prepared_train.csv')
X = df.drop(['text_type'],axis=1)
X['text'] = X['text'].fillna('space')
y  = df['text_type']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
vectorizer = CountVectorizer()
scaler = StandardScaler()

# создаем трансформер для удобной предобработки данных
preprocessor = ColumnTransformer(
  transformers=[('text', vectorizer, 'text'),('num', scaler, ['spam_symbols', 'not_spam_symbols', 'special_symbols', 'digits','text_len', 'words_count', 'emojis_count'])])

# создаем пайплайн обучения модели
model = Pipeline(steps=[('preprocessor', preprocessor),('classifier', LogisticRegression(max_iter=5000))])

model.fit(X_train, y_train)

# предсказать вероятности классов для тестового набора
y_pred_train = model.predict_proba(X_train)[:, 1]
y_pred_test = model.predict_proba(X_test)[:, 1]
results = pd.DataFrame({'Train: ':roc_auc_score(y_train, y_pred_train),'Test: ':roc_auc_score(y_test, y_pred_test)},index=['0'])
results

Unnamed: 0,Train:,Test:
0,0.999608,0.985825


Подгрузка тестового файла и предсказание на нем.

In [11]:
data = pd.read_csv('prepared_test.csv')
data['text'] = data['text'].fillna('space')
y_pred_final = model.predict_proba(data)[:, 1]
data['score'] = y_pred_final
data.head()

Unnamed: 0,text,spam_symbols,not_spam_symbols,special_symbols,digits,text_len,words_count,emojis_count,score
0,j jim whitehead ejw cse ucsc edu write j open ...,2,5,0,0,436,69,0,0.001849
1,origin messag bitbitch magnesium net peopl scr...,2,2,0,0,304,46,0,0.001729
2,java manag vinc durasoft taught java class gro...,2,9,0,7,435,75,0,0.010542
3,youtub name saiman say,0,0,0,0,22,4,0,0.021355
4,underpr issu high return equiti oil gas adviso...,1,4,0,7,512,84,0,0.98225


Сохранение файла в нужном формате.

In [13]:
data[['score','text']].to_csv('scored_test.csv',index = False)