<a href="https://colab.research.google.com/github/krakowiakpawel9/ml_course/blob/master/sl/25_spam_detector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### scikit-learn
Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  

Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Podstawowa biblioteka do uczenia maszynowego w języku Python.

Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install scikit-learn
```
Aby zaktualizować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install --upgrade scikit-learn
```
Kurs stworzony w oparciu o wersję `0.22.1`

### Preprocessing danych:
1. [Import bibliotek](#0)
2. [Wygenerowanie danych](#1)
3. [Utworzenie kopii danych](#2)
4. [Zmiana typu danych i wstępna eksploracja](#3)
5. [LabelEncoder](#4)
6. [OneHotEncoder](#5)
7. [Pandas *get_dummies()*](#6)
8. [Standaryzacja - StandardScaler](#7)
9. [Przygotowanie danych do modelu](#8)



### <a name='0'></a> Import bibliotek

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px
import sklearn
sklearn.__version__

'0.22.1'

In [2]:
df = pd.read_csv('spam-data.csv', encoding='latin-1')
df.head()

Unnamed: 0,kategoria,tekst
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
kategoria    5572 non-null object
tekst        5572 non-null object
dtypes: object(2)
memory usage: 87.2+ KB


In [4]:
df['kategoria'].value_counts()

ham     4825
spam     747
Name: kategoria, dtype: int64

In [5]:
df.describe().T

Unnamed: 0,count,unique,top,freq
kategoria,5572,2,ham,4825
tekst,5572,5169,"Sorry, I'll call later",30


In [6]:
df.isnull().sum()

kategoria    0
tekst        0
dtype: int64

In [7]:
categories = df['kategoria'].value_counts()
categories = categories.reset_index()
categories

Unnamed: 0,index,kategoria
0,ham,4825
1,spam,747


In [8]:
px.pie(categories, 'index', 'kategoria', width=700, height=400, hole=0.4, title='Rozkład kategorii (spam, nie spam)')

In [9]:
df['tekst'].value_counts()[:15]

Sorry, I'll call later                                                                                                                                                                 30
I cant pick the phone right now. Pls send a message                                                                                                                                    12
Ok...                                                                                                                                                                                  10
Please call our customer service representative on FREEPHONE 0808 145 4742 between 9am-11pm as you have WON a guaranteed Ã¥Â£1000 cash or Ã¥Â£5000 prize!                               4
Say this slowly.? GOD,I LOVE YOU &amp; I NEED YOU,CLEAN MY HEART WITH YOUR BLOOD.Send this to Ten special people &amp; u c miracle tomorrow, do it,pls,pls do it...                     4
Your opinion about me? 1. Over 2. Jada 3. Kusruthi 4. Lovable 5. Silen

In [10]:
spam_sms = df[df['kategoria'] == 'spam']['tekst']
ham_sms = df[df['kategoria'] == 'ham']['tekst']
spam_sms[:5]

2     Free entry in 2 a wkly comp to win FA Cup fina...
5     FreeMsg Hey there darling it's been 3 week's n...
8     WINNER!! As a valued network customer you have...
9     Had your mobile 11 months or more? U R entitle...
11    SIX chances to win CASH! From 100 to 20,000 po...
Name: tekst, dtype: object

In [11]:
ham_sms[:5]

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
6    Even my brother is not like to speak with me. ...
Name: tekst, dtype: object

In [12]:
df.head()

Unnamed: 0,kategoria,tekst
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [16]:
df['dlugosc'] = df['tekst'].str.split().apply(len)
df.head()

Unnamed: 0,kategoria,tekst,dlugosc
0,ham,"Go until jurong point, crazy.. Available only ...",20
1,ham,Ok lar... Joking wif u oni...,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,28
3,ham,U dun say so early hor... U c already then say...,11
4,ham,"Nah I don't think he goes to usf, he lives aro...",13


In [29]:
px.histogram(df, x='dlugosc', facet_col='kategoria', width=800, height=400, nbins=100, range_x=[0, 50])