## Verisetinin Yüklenmesi

<span style="font-size:1.1em;">Colab'a Google drive'ı entegre ediyoruz. Kullanılacak olan veriseti Google Drive'da bulunmaktadır</span>

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

In [0]:
PREPROCESSED_DATASET_WITH_STEMMER = "gdrive/My Drive/mbti/preprocessed_dataset_with_stemming.csv"
PREPROCESSED_DATASET_WITHOUT_STEMMER = "gdrive/My Drive/mbti/preprocessed_dataset_no_stemming.csv"
PREPROCESSED_DATASET_ZEMBEREK = "gdrive/My Drive/mbti/preprocessed_dataset_zemberek.csv"
TRIMMED_DATASET = "gdrive/My Drive/mbti/trimmed_dataset.csv"
RAW_DATASET = "gdrive/My Drive/mbti/all_users_v2.csv"

<span style="font-size:1.1em;">Hangi veriseti kullanılarak işlem yapılacaksa yukardaki pathlerden biri seçilir ve parametre olarak verilir.</span>

In [0]:
import pandas as pd 
df = pd.read_csv(PREPROCESSED_DATASET_WITHOUT_STEMMER, sep = ';', header = 0)

In [0]:
df

## Feature Extraction

<span style="font-size:1.1em;">TF-IDF özellik vektörünün çıkartılmasında kullanılacak değişken aşağıda belirlenmiş olan parametrelerle oluşturulur.</span>

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [0]:
count_vectorizer = CountVectorizer()

In [0]:
import numpy as np

df['entry'] = df['entry'].apply(lambda x: np.str_(x)) # ValueError: np.nan is an invalid document seklinde bir hata verdigi icin bunu asmak adina yapildi.

## Modelin Oluşturulması

<span style="font-size:1.1em;">k-Fold Cross Validation yapılarak bütün veriseti üzerinde modelin test edilmesi sağlanmıştır. k değeri kadar veriseti parçaya bölünür. k-1 tane parça train set, geriye kalan 1 parça ise test set olarak ayrılır. Ve her iterasyon sonucunda bu parçalar değiştirilir.</span>

In [0]:
from sklearn.model_selection import KFold 
kFold = KFold(n_splits = 5, shuffle = True, random_state = None)

Sonuçları kaydedilmek için dictionary oluşturulur.

In [0]:
results = {
    'predicted': {
        'I': {'actual': {'I': 0, 'E': 0}},
        'E': {'actual': {'I': 0, 'E': 0}},

        'S': {'actual': {'S': 0, 'N': 0}},
        'N': {'actual': {'S': 0, 'N': 0}},

        'T': {'actual': {'T': 0, 'F': 0}},
        'F': {'actual': {'T': 0, 'F': 0}},

        'J': {'actual': {'J': 0, 'P': 0}},
        'P': {'actual': {'J': 0, 'P': 0}},

        'analysts': {'actual': {'analysts': 0, 'diplomats': 0, 'explorers': 0, 'sentinels': 0}},
        'diplomats': {'actual': {'analysts': 0, 'diplomats': 0, 'explorers': 0, 'sentinels': 0}},
        'explorers': {'actual': {'analysts': 0, 'diplomats': 0, 'explorers': 0, 'sentinels': 0}},
        'sentinels': {'actual': {'analysts': 0, 'diplomats': 0, 'explorers': 0, 'sentinels': 0}}
    }

}

**typeClass** tahmin edilir

In [0]:
from sklearn.naive_bayes import MultinomialNB

iteration = 1

for train_indices, test_indices in kFold.split(df):    
  print("Started iteration: {}".format(iteration))
  train = df.iloc[train_indices]
  X_train = train['entry']
  y_train = train['typeClass']

  test  = df.iloc[test_indices]
  X_test = test['entry']    
  y_test = test['typeClass']

  X_train_count = count_vectorizer.fit_transform(X_train)
  X_test_count = count_vectorizer.transform(X_test)

  tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
  X_train_tf = tf_transformer.transform(X_train_count)
  X_test_tf = tf_transformer.transform(X_test_count)

  clf = MultinomialNB().fit(X_train_tf, y_train)
  y_test = y_test.values
  y_predicted = clf.predict(X_test_tf)

  print("Finished iteration: {}".format(iteration))
  iteration += 1
  
  for i in range(len(y_predicted)):
    actual = y_test[i]
    predicted = y_predicted[i]
    results['predicted'][predicted]['actual'][actual] += 1


**I/E** boyutu tahmin edilir

In [0]:
iteration = 1

for train_indices, test_indices in kFold.split(df):    
  print("Started iteration: {}".format(iteration))
  train = df.iloc[train_indices]
  X_train = train['entry']
  y_train = train['I/E']

  test  = df.iloc[test_indices]
  X_test = test['entry']    
  y_test = test['I/E']

  X_train_count = count_vectorizer.fit_transform(X_train)
  X_test_count = count_vectorizer.transform(X_test)

  tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
  X_train_tf = tf_transformer.transform(X_train_count)
  X_test_tf = tf_transformer.transform(X_test_count)

  clf = MultinomialNB().fit(X_train_tf, y_train)
  y_test = y_test.values
  y_predicted = clf.predict(X_test_tf)

  print("Finished iteration: {}".format(iteration))
  iteration += 1
  
  for i in range(len(y_predicted)):
    actual = y_test[i]
    predicted = y_predicted[i]
    results['predicted'][predicted]['actual'][actual] += 1

**S/N** boyutu tahmin edilir

In [0]:
iteration = 1

for train_indices, test_indices in kFold.split(df):    
  print("Started iteration: {}".format(iteration))
  train = df.iloc[train_indices]
  X_train = train['entry']
  y_train = train['S/N']

  test  = df.iloc[test_indices]
  X_test = test['entry']    
  y_test = test['S/N']

  X_train_count = count_vectorizer.fit_transform(X_train)
  X_test_count = count_vectorizer.transform(X_test)

  tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
  X_train_tf = tf_transformer.transform(X_train_count)
  X_test_tf = tf_transformer.transform(X_test_count)

  clf = MultinomialNB().fit(X_train_tf, y_train)
  y_test = y_test.values
  y_predicted = clf.predict(X_test_tf)

  print("Finished iteration: {}".format(iteration))
  iteration += 1
  
  for i in range(len(y_predicted)):
    actual = y_test[i]
    predicted = y_predicted[i]
    results['predicted'][predicted]['actual'][actual] += 1


**T/F** boyutu tahmin

In [0]:
iteration = 1

for train_indices, test_indices in kFold.split(df):    
  print("Started iteration: {}".format(iteration))
  train = df.iloc[train_indices]
  X_train = train['entry']
  y_train = train['T/F']

  test  = df.iloc[test_indices]
  X_test = test['entry']    
  y_test = test['T/F']

  X_train_count = count_vectorizer.fit_transform(X_train)
  X_test_count = count_vectorizer.transform(X_test)

  tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
  X_train_tf = tf_transformer.transform(X_train_count)
  X_test_tf = tf_transformer.transform(X_test_count)

  clf = MultinomialNB().fit(X_train_tf, y_train)
  y_test = y_test.values
  y_predicted = clf.predict(X_test_tf)

  print("Finished iteration: {}".format(iteration))
  iteration += 1
  
  for i in range(len(y_predicted)):
    actual = y_test[i]
    predicted = y_predicted[i]
    results['predicted'][predicted]['actual'][actual] += 1


**J/P** boyutu tahmin

In [0]:
iteration = 1

for train_indices, test_indices in kFold.split(df):    
  print("Started iteration: {}".format(iteration))
  train = df.iloc[train_indices]
  X_train = train['entry']
  y_train = train['J/P']

  test  = df.iloc[test_indices]
  X_test = test['entry']    
  y_test = test['J/P']

  X_train_count = count_vectorizer.fit_transform(X_train)
  X_test_count = count_vectorizer.transform(X_test)

  tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
  X_train_tf = tf_transformer.transform(X_train_count)
  X_test_tf = tf_transformer.transform(X_test_count)

  clf = MultinomialNB().fit(X_train_tf, y_train)
  y_test = y_test.values
  y_predicted = clf.predict(X_test_tf)

  print("Finished iteration: {}".format(iteration))
  iteration += 1
  
  for i in range(len(y_predicted)):
    actual = y_test[i]
    predicted = y_predicted[i]
    results['predicted'][predicted]['actual'][actual] += 1


In [0]:
results