# Логистическая регрессия, гиперпараметры, веса

## Данные
Как обычно, сначала загрузим данные и посмотрим на них.

In [2]:
import pandas as pd

  return f(*args, **kwds)
  return f(*args, **kwds)


In [3]:
df = pd.read_csv('news_lenta_cropped.csv')

In [4]:
df.head()

Unnamed: 0,title,topic
0,Грудь Бритни Спирс вновь выскочила из лифчика ...,Культура
1,Попытка вернуть укатившийся мяч у Кремлевской ...,Силовые структуры
2,Хабенский стал врагом Украины,Культура
3,В Туркмении запретили продажу алкоголя,Бывший СССР
4,В Великобритании нашли утерянный сценарий Стен...,Культура


In [5]:
len(df)

129930

### Задание 1

Сколько всего значений принимает `topic`? Сколько объектов относится к каждому из топиков?

In [6]:
set(df.topic)

{'Бывший СССР', 'Культура', 'Силовые структуры', 'Ценности'}

In [9]:
for top in set(df.topic):
    print(len(df[df.topic == top]))

6832
18480
52018
52600


## Baseline

Посмотрим, как с классификацией справится наивный байес и CountVectorizer с дефолтными настройками.

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

  return f(*args, **kwds)


In [31]:
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(df.title)

In [32]:
X_train, X_test, y_train, y_test = train_test_split(bow, df.topic)

In [33]:
nb = MultinomialNB()
clf = nb.fit(X_train, y_train)

In [34]:
from sklearn.metrics import classification_report

In [35]:
print(classification_report(y_test, clf.predict(X_test)))

                   precision    recall  f1-score   support

      Бывший СССР       0.92      0.97      0.94     12931
         Культура       0.96      0.94      0.95     13301
Силовые структуры       0.89      0.86      0.87      4635
         Ценности       0.92      0.78      0.84      1616

      avg / total       0.93      0.93      0.93     32483



In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [28]:
tfidf = TfidfVectorizer(min_df=10)
bow_tfidf = tfidf.fit_transform(df.title)

In [25]:
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(bow_tfidf, df.topic)

## LogisticRegression

In [18]:
from sklearn.linear_model import LogisticRegression

  return f(*args, **kwds)


In [26]:
lr = LogisticRegression(random_state=42)
clf = lr.fit(X_train_tfidf, y_train_tfidf)

In [27]:
print(classification_report(y_test_tfidf, clf.predict(X_test_tfidf)))

                   precision    recall  f1-score   support

      Бывший СССР       0.94      0.96      0.95     12988
         Культура       0.91      0.97      0.94     13215
Силовые структуры       0.93      0.82      0.87      4577
         Ценности       0.97      0.67      0.79      1703

      avg / total       0.93      0.93      0.93     32483



Получилось чууть получше. Но что если?..

## Подбор параметров

### Вручную

In [39]:
lr = LogisticRegression(C=9, random_state=42, solver='sag')
clf = lr.fit(X_train, y_train)
print(classification_report(y_test, clf.predict(X_test)))



                   precision    recall  f1-score   support

      Бывший СССР       0.96      0.96      0.96     12931
         Культура       0.95      0.97      0.96     13301
Силовые структуры       0.92      0.89      0.90      4635
         Ценности       0.94      0.82      0.87      1616

      avg / total       0.95      0.95      0.95     32483



Ю-ху, ещё лучше!

### В цикле

### Grid Search

Самый продвинутый и out-of-box способ делать это.

In [26]:
from sklearn.model_selection import GridSearchCV

In [27]:
clf = LogisticRegression()
grid_values = {'penalty': ['l1', 'l2'],'C':[0.001,.009,0.01,.09,1,5,10,25]}
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values)
grid_clf_acc.fit(X_train, y_train)

In [57]:
help(LogisticRegression)

Help on class LogisticRegression in module sklearn.linear_model.logistic:

class LogisticRegression(sklearn.base.BaseEstimator, sklearn.linear_model.base.LinearClassifierMixin, sklearn.linear_model.base.SparseCoefMixin)
 |  Logistic Regression (aka logit, MaxEnt) classifier.
 |  
 |  In the multiclass case, the training algorithm uses the one-vs-rest (OvR)
 |  scheme if the 'multi_class' option is set to 'ovr', and uses the cross-
 |  entropy loss if the 'multi_class' option is set to 'multinomial'.
 |  (Currently the 'multinomial' option is supported only by the 'lbfgs',
 |  'sag' and 'newton-cg' solvers.)
 |  
 |  This class implements regularized logistic regression using the
 |  'liblinear' library, 'newton-cg', 'sag' and 'lbfgs' solvers. It can handle
 |  both dense and sparse input. Use C-ordered arrays or CSR matrices
 |  containing 64-bit floats for optimal performance; any other input format
 |  will be converted (and copied).
 |  
 |  The 'newton-cg', 'sag', and 'lbfgs' solve

## SGD

А теперь попробуйте проделать то же самое с `SGDClassifier`.

In [40]:
from sklearn.linear_model import SGDClassifier

In [30]:
help(SGDClassifier)

Help on class SGDClassifier in module sklearn.linear_model.stochastic_gradient:

class SGDClassifier(BaseSGDClassifier)
 |  Linear classifiers (SVM, logistic regression, a.o.) with SGD training.
 |  
 |  This estimator implements regularized linear models with stochastic
 |  gradient descent (SGD) learning: the gradient of the loss is estimated
 |  each sample at a time and the model is updated along the way with a
 |  decreasing strength schedule (aka learning rate). SGD allows minibatch
 |  (online/out-of-core) learning, see the partial_fit method.
 |  For best results using the default learning rate schedule, the data should
 |  have zero mean and unit variance.
 |  
 |  This implementation works with data represented as dense or sparse arrays
 |  of floating point values for the features. The model it fits can be
 |  controlled with the loss parameter; by default, it fits a linear support
 |  vector machine (SVM).
 |  
 |  The regularizer is a penalty added to the loss function tha

In [56]:
sgd = SGDClassifier(loss='modified_huber', alpha = 0.0001, class_weight='balanced')
clf = sgd.fit(X_train, y_train)
print(classification_report(y_test, clf.predict(X_test)))



                   precision    recall  f1-score   support

      Бывший СССР       0.97      0.95      0.96     12931
         Культура       0.95      0.96      0.96     13301
Силовые структуры       0.90      0.91      0.90      4635
         Ценности       0.89      0.86      0.88      1616

      avg / total       0.95      0.95      0.95     32483

