## Verisetinin Yüklenmesi

<span style="font-size:1.1em;">Colab'a Google drive'ı entegre ediyoruz. Kullanılacak olan veriseti Google Drive'da bulunmaktadır</span>

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
PREPROCESSED_DATASET_WITH_STEMMER = "gdrive/My Drive/mbti/preprocessed_dataset_with_stemming.csv"
PREPROCESSED_DATASET_WITHOUT_STEMMER = "gdrive/My Drive/mbti/preprocessed_dataset_no_stemming.csv"
PREPROCESSED_DATASET_ZEMBEREK = "gdrive/My Drive/mbti/preprocessed_dataset_zemberek.csv"
TRIMMED_DATASET = "gdrive/My Drive/mbti/trimmed_dataset.csv"
RAW_DATASET = "gdrive/My Drive/mbti/all_users_v2.csv"

<span style="font-size:1.1em;">Hangi veriseti kullanılarak işlem yapılacaksa yukardaki pathlerden biri seçilir ve parametre olarak verilir.</span>

In [0]:
import pandas as pd 
df = pd.read_csv(PREPROCESSED_DATASET_ZEMBEREK, sep = ';', header = 0)

In [4]:
df

Unnamed: 0,user,entry,type,typeClass,I/E,S/N,T/F,J/P
0,19991991,ekşi itiraf dön dolaş gel kendi çat problem ge...,ENTJ,analysts,E,N,T,J
1,19991991,selda bağcan ses dinle dinleyebil dinleyebilme...,ENTJ,analysts,E,N,T,J
2,19991991,eski sevgili mutlu ol olma iste isteyen insan ...,ENTJ,analysts,E,N,T,J
3,19991991,veda et ederken not bırak bırakmak fark farklı...,ENTJ,analysts,E,N,T,J
4,19991991,ingiliz aksa ara bayıl bayıldık konuş konuşan ...,ENTJ,analysts,E,N,T,J
...,...,...,...,...,...,...,...,...
524797,zaimoglu,zlatan ıbrahimovic türkiye katil,ESFJ,sentinels,E,S,F,J
524798,zaimoglu,tarih tarihteki büyük yalan yalancı şike opera...,ESFJ,sentinels,E,S,F,J
524799,zaimoglu,akp chp koalisyon hayal koalisyon,ESFJ,sentinels,E,S,F,J
524800,zaimoglu,trabzon trabzonlu insan hamsi de diyen fenerba...,ESFJ,sentinels,E,S,F,J


## Feature Extraction

<span style="font-size:1.1em;">TF-IDF özellik vektörünün çıkartılmasında kullanılacak değişken aşağıda belirlenmiş olan parametrelerle oluşturulur.</span>

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [0]:
count_vectorizer = CountVectorizer()

In [0]:
import numpy as np

df['entry'] = df['entry'].apply(lambda x: np.str_(x)) # ValueError: np.nan is an invalid document seklinde bir hata verdigi icin bunu asmak adina yapildi.

## Modelin Oluşturulması

<span style="font-size:1.1em;">Veriseti train ve test olmak üzere ikiye ayrılır. Test %20 ve train %80'ini oluşturacak şekilde tüm veriseti bölünür. random_state parametresi ile tekrardan bölündüğünde bir öncekiyle aynı train ve test veri setlerinin oluşturulması sağlanır.</span>

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train, X_test, y_train, y_test = train_test_split(df['entry'], df['typeClass'], random_state = 42, test_size = 0.20)

Train ve test datasetlerinden tf-idf vektörleri çıkartılır


In [0]:
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
X_train_tf = tf_transformer.transform(X_train_count)
X_test_tf = tf_transformer.transform(X_test_count)


Multinominal Naive Bayes modeli oluşturulur. Oluşturulan bu model verisetinde "type" olarak belirtilen "analysts", "diplomats", "sentimenls", "explorers" sınıflarından hangilerine ait olduğunu tahmin etmek için kullanılır

In [0]:
from sklearn.naive_bayes import MultinomialNB

In [0]:
clf = MultinomialNB().fit(X_train_tf, y_train)
test_typeClass = y_test.values

predictions = clf.predict(X_test_tf)

In [13]:
predictions

array(['analysts', 'analysts', 'analysts', ..., 'analysts', 'diplomats',
       'analysts'], dtype='<U9')

<span style="font-size:1.1em">Yapılacak tahminlerle ilgili istatistiksel verileri tutmak için</span> ```predictions_result```<span style="font-size:1.1em"> adında bir değişken oluşturulur.</span>

<span style="font-size:1.1em">Bu değişkenin yapısı aşağıdaki gibidir.</span>

```json
{
    "predicted": {
        "analysts":  { "actual": {"analysts": 0, "diplomats": 0, "explorers": 0, "sentinels": 0} }
        "diplomats": { "actual": {"analysts": 0, "diplomats": 0, "explorers": 0, "sentinels": 0} }
        "explorers": { "actual": {"analysts": 0, "diplomats": 0, "explorers": 0, "sentinels": 0} }
        "sentinels": { "actual": {"analysts": 0, "diplomats": 0, "explorers": 0, "sentinels": 0} }
    }
}
```

* <span style="font-size:1.1em;">Yapılan tahminlerle ilgili verilere ulaşabilmek için</span>

    ```predictions_results['predicted']```


* <span style="font-size:1.1em;">Yapılan tahminin analyst ise:</span>

    ```predictions_results['predicted']['analysts']``` 


* <span style="font-size:1.1em;">Yapılan analyst tahmininin gerçek değerlerine erişmek için:</span>     

    ```predictions_results['predicted']['analysts']['actual']```  


* <span style="font-size:1.1em;">Test verisi, model tarafından analysts olarak tahmin edilmiştir ve bu verinin gerçek değeri de analysts'tir.</span>

    ```predictions_results['predicted']['analysts']['actual']['analysts']``` 

In [0]:
prediction_results = {'predicted': {}}  ## prediction_result['analysts'] means prediction is 'analysts'

prediction_results['predicted']['analysts']  = {'actual': {'analysts': 0, 'diplomats': 0, 'explorers': 0, 'sentinels': 0}}
prediction_results['predicted']['diplomats'] = {'actual': {'analysts': 0, 'diplomats': 0, 'explorers': 0, 'sentinels': 0}}
prediction_results['predicted']['explorers'] = {'actual': {'analysts': 0, 'diplomats': 0, 'explorers': 0, 'sentinels': 0}}
prediction_results['predicted']['sentinels'] = {'actual': {'analysts': 0, 'diplomats': 0, 'explorers': 0, 'sentinels': 0}}

## prediction_result['analysts']['diplomats'] means prediction is analysts but actual value is diplomats

```prediction_results```<span style="font-size:1.1em"> içerisinde tutulan sayaçların değerleri arttırılır.</span>

In [0]:
for i in range(len(predictions)):
  predicted_value = predictions[i]
  actual_value = test_typeClass[i]
  prediction_results['predicted'][predicted_value]['actual'][actual_value] += 1

<span style="font-size:1.1em">JSON formatına çevrilir </span>```dict``` <span style="font-size:1.1em">tipi. Bu sayede daha okunaklı bir şekilde print edilmiş olur. </span>

In [16]:
import json

print(json.dumps(prediction_results, indent = 2))

{
  "predicted": {
    "analysts": {
      "actual": {
        "analysts": 26812,
        "diplomats": 18295,
        "explorers": 6890,
        "sentinels": 12274
      }
    },
    "diplomats": {
      "actual": {
        "analysts": 11247,
        "diplomats": 18883,
        "explorers": 4047,
        "sentinels": 6485
      }
    },
    "explorers": {
      "actual": {
        "analysts": 0,
        "diplomats": 0,
        "explorers": 3,
        "sentinels": 0
      }
    },
    "sentinels": {
      "actual": {
        "analysts": 3,
        "diplomats": 5,
        "explorers": 3,
        "sentinels": 14
      }
    }
  }
}


<span style="font-size:1.1em;">İlgili field extract edilir</span> ```dict``` <span style="font-size:1.1em;">yapısından.</span>

In [0]:
results = prediction_results['predicted']

<span style="font-size:1.1em;">Başarı oranı hesaplanır</span>

In [18]:
accuracy = (results['analysts']['actual']['analysts'] + results['diplomats']['actual']['diplomats'] + results['explorers']['actual']['explorers'] + results['sentinels']['actual']['sentinels']) / len(predictions)
accuracy

0.4355141433484818

<span style="font-size:1.1em;">**E/I** boyutu tahmin edilir</span>

In [0]:
X_train, X_test, y_train, y_test = train_test_split(df['entry'], df['I/E'], random_state = 42)  ## Geri kalanlar S, T, J


X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
X_train_tf = tf_transformer.transform(X_train_count)
X_test_tf = tf_transformer.transform(X_test_count)

clf = MultinomialNB().fit(X_train_tf, y_train)
test_typeClass = y_test.values

predictions = clf.predict(X_test_tf)

In [20]:
predictions

array(['E', 'I', 'E', ..., 'E', 'I', 'E'], dtype='<U1')

In [21]:
predicted = {}
predicted['I'] = {'actual': {'I': 0, 'E': 0}}
predicted['E'] = {'actual': {'I': 0, 'E': 0}}
predicted

{'E': {'actual': {'E': 0, 'I': 0}}, 'I': {'actual': {'E': 0, 'I': 0}}}

In [0]:
for i in range(len(predictions)):
  predicted[predictions[i]]['actual'][test_typeClass[i]] += 1


In [23]:
predicted

{'E': {'actual': {'E': 43000, 'I': 31874}},
 'I': {'actual': {'E': 23363, 'I': 32964}}}

In [0]:
accuracy = (predicted['E']['actual']['E'] + predicted['I']['actual']['I']) / len(predictions)

In [25]:
accuracy

0.5789894894093796

<span style="font-size:1.1em">**S/N** boyutu tahmin edilir.</span>

In [0]:
X_train, X_test, y_train, y_test = train_test_split(df['entry'], df['S/N'], random_state = 42)  ## Geri kalan boyutlar: T, J

X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
X_train_tf = tf_transformer.transform(X_train_count)
X_test_tf = tf_transformer.transform(X_test_count)


clf = MultinomialNB().fit(X_train_tf, y_train)
test_typeClass = y_test.values

predictions = clf.predict(X_test_tf)

In [27]:
predicted['N'] = {'actual': {'N': 0, 'S': 0}}
predicted['S'] = {'actual': {'N': 0, 'S': 0}}

predicted

{'E': {'actual': {'E': 43000, 'I': 31874}},
 'I': {'actual': {'E': 23363, 'I': 32964}},
 'N': {'actual': {'N': 0, 'S': 0}},
 'S': {'actual': {'N': 0, 'S': 0}}}

In [0]:
for i in range(len(predictions)):
  predicted[predictions[i]]['actual'][test_typeClass[i]] += 1

In [29]:
predicted

{'E': {'actual': {'E': 43000, 'I': 31874}},
 'I': {'actual': {'E': 23363, 'I': 32964}},
 'N': {'actual': {'N': 94092, 'S': 37078}},
 'S': {'actual': {'N': 10, 'S': 21}}}

In [0]:
accuracy = (predicted['N']['actual']['N'] + predicted['S']['actual']['S']) / len(predictions)

In [31]:
accuracy

0.7173192277497885

<span style="font-size:1.1em">**T/F** boyutu tahmin edilir.</span>

In [0]:
X_train, X_test, y_train, y_test = train_test_split(df['entry'], df['T/F'], random_state = 42)  ##  J

X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
X_train_tf = tf_transformer.transform(X_train_count)
X_test_tf = tf_transformer.transform(X_test_count)

clf = MultinomialNB().fit(X_train_tf, y_train)
test_typeClass = y_test.values

predictions = clf.predict(X_test_tf)

In [33]:
predicted['T'] = {'actual': {'T': 0, 'F': 0}}
predicted['F'] = {'actual': {'T': 0, 'F': 0}}

predicted

{'E': {'actual': {'E': 43000, 'I': 31874}},
 'F': {'actual': {'F': 0, 'T': 0}},
 'I': {'actual': {'E': 23363, 'I': 32964}},
 'N': {'actual': {'N': 94092, 'S': 37078}},
 'S': {'actual': {'N': 10, 'S': 21}},
 'T': {'actual': {'F': 0, 'T': 0}}}

In [0]:
for i in range(len(predictions)):
  predicted[predictions[i]]['actual'][test_typeClass[i]] += 1

In [35]:
predicted

{'E': {'actual': {'E': 43000, 'I': 31874}},
 'F': {'actual': {'F': 24909, 'T': 14794}},
 'I': {'actual': {'E': 23363, 'I': 32964}},
 'N': {'actual': {'N': 94092, 'S': 37078}},
 'S': {'actual': {'N': 10, 'S': 21}},
 'T': {'actual': {'F': 37588, 'T': 53910}}}

In [0]:
accuracy = (predicted['F']['actual']['F'] + predicted['T']['actual']['T']) / len(predictions)

In [37]:
accuracy

0.6007499942835801

<span style="font-size:1.1em">**J/P** boyutu tahmin edilir.</span>

In [0]:
X_train, X_test, y_train, y_test = train_test_split(df['entry'], df['J/P'], random_state = 42) 

X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
X_train_tf = tf_transformer.transform(X_train_count)
X_test_tf = tf_transformer.transform(X_test_count)

clf = MultinomialNB().fit(X_train_tf, y_train)
test_typeClass = y_test.values

predictions = clf.predict(X_test_tf)

In [39]:
predicted['J'] = {'actual': {'J': 0, 'P': 0}}
predicted['P'] = {'actual': {'J': 0, 'P': 0}}

predicted

{'E': {'actual': {'E': 43000, 'I': 31874}},
 'F': {'actual': {'F': 24909, 'T': 14794}},
 'I': {'actual': {'E': 23363, 'I': 32964}},
 'J': {'actual': {'J': 0, 'P': 0}},
 'N': {'actual': {'N': 94092, 'S': 37078}},
 'P': {'actual': {'J': 0, 'P': 0}},
 'S': {'actual': {'N': 10, 'S': 21}},
 'T': {'actual': {'F': 37588, 'T': 53910}}}

In [0]:
for i in range(len(predictions)):
  predicted[predictions[i]]['actual'][test_typeClass[i]] += 1

In [41]:
predicted

{'E': {'actual': {'E': 43000, 'I': 31874}},
 'F': {'actual': {'F': 24909, 'T': 14794}},
 'I': {'actual': {'E': 23363, 'I': 32964}},
 'J': {'actual': {'J': 27457, 'P': 18951}},
 'N': {'actual': {'N': 94092, 'S': 37078}},
 'P': {'actual': {'J': 36862, 'P': 47931}},
 'S': {'actual': {'N': 10, 'S': 21}},
 'T': {'actual': {'F': 37588, 'T': 53910}}}

In [0]:
accuracy = (predicted['P']['actual']['P'] + predicted['J']['actual']['J']) / len(predictions)

In [43]:
accuracy

0.5745992789689103

In [44]:
prediction_results['predicted'].update(predicted) 

prediction_results

{'predicted': {'E': {'actual': {'E': 43000, 'I': 31874}},
  'F': {'actual': {'F': 24909, 'T': 14794}},
  'I': {'actual': {'E': 23363, 'I': 32964}},
  'J': {'actual': {'J': 27457, 'P': 18951}},
  'N': {'actual': {'N': 94092, 'S': 37078}},
  'P': {'actual': {'J': 36862, 'P': 47931}},
  'S': {'actual': {'N': 10, 'S': 21}},
  'T': {'actual': {'F': 37588, 'T': 53910}},
  'analysts': {'actual': {'analysts': 26812,
    'diplomats': 18295,
    'explorers': 6890,
    'sentinels': 12274}},
  'diplomats': {'actual': {'analysts': 11247,
    'diplomats': 18883,
    'explorers': 4047,
    'sentinels': 6485}},
  'explorers': {'actual': {'analysts': 0,
    'diplomats': 0,
    'explorers': 3,
    'sentinels': 0}},
  'sentinels': {'actual': {'analysts': 3,
    'diplomats': 5,
    'explorers': 3,
    'sentinels': 14}}}}