## Verisetinin Yüklenmesi

<span style="font-size:1.1em;">Colab'a Google drive'ı entegre ediyoruz. Kullanılacak olan veriseti Google Drive'da bulunmaktadır</span>

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

In [0]:
PREPROCESSED_DATASET_WITH_STEMMER = "gdrive/My Drive/mbti/preprocessed_dataset_with_stemming.csv"
PREPROCESSED_DATASET_WITHOUT_STEMMER = "gdrive/My Drive/mbti/preprocessed_dataset_no_stemming.csv"
PREPROCESSED_DATASET_ZEMBEREK = "gdrive/My Drive/mbti/preprocessed_dataset_zemberek.csv"
TRIMMED_DATASET = "gdrive/My Drive/mbti/trimmed_dataset.csv"
RAW_DATASET = "gdrive/My Drive/mbti/all_users_v2.csv"

<span style="font-size:1.1em;">Hangi veriseti kullanılarak işlem yapılacaksa yukardaki pathlerden biri seçilir ve parametre olarak verilir.</span>

In [0]:
import pandas as pd 
df = pd.read_csv(PREPROCESSED_DATASET_WITH_STEMMER, sep = ';', header = 0)

In [0]:
df

## Feature Extraction

<span style="font-size:1.1em;">TF-IDF özellik vektörünün çıkartılmasında kullanılacak değişken aşağıda belirlenmiş olan parametrelerle oluşturulur.</span>

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [0]:
count_vectorizer = CountVectorizer()

In [0]:
import numpy as np

df['entry'] = df['entry'].apply(lambda x: np.str_(x)) # ValueError: np.nan is an invalid document seklinde bir hata verdigi icin bunu asmak adina yapildi.

## Modelin Oluşturulması

<span style="font-size:1.1em;">k-Fold Cross Validation yapılarak bütün veriseti üzerinde modelin test edilmesi sağlanmıştır. k değeri kadar veriseti parçaya bölünür. k-1 tane parça train set, geriye kalan 1 parça ise test set olarak ayrılır. Ve her iterasyon sonucunda bu parçalar değiştirilir.</span>

In [0]:
from sklearn.model_selection import KFold 
kFold = KFold(n_splits = 5, shuffle = True, random_state = None)

Sonuçları kaydedilmek için dictionary oluşturulur.

In [0]:
results = {
    'predicted': {
        'I': {'actual': {'I': 0, 'E': 0}},
        'E': {'actual': {'I': 0, 'E': 0}},

        'S': {'actual': {'S': 0, 'N': 0}},
        'N': {'actual': {'S': 0, 'N': 0}},

        'T': {'actual': {'T': 0, 'F': 0}},
        'F': {'actual': {'T': 0, 'F': 0}},

        'J': {'actual': {'J': 0, 'P': 0}},
        'P': {'actual': {'J': 0, 'P': 0}},

        'analysts': {'actual': {'analysts': 0, 'diplomats': 0, 'explorers': 0, 'sentinels': 0}},
        'diplomats': {'actual': {'analysts': 0, 'diplomats': 0, 'explorers': 0, 'sentinels': 0}},
        'explorers': {'actual': {'analysts': 0, 'diplomats': 0, 'explorers': 0, 'sentinels': 0}},
        'sentinels': {'actual': {'analysts': 0, 'diplomats': 0, 'explorers': 0, 'sentinels': 0}}
    }

}

**typeClass** tahmin edilir

In [0]:
min_entry = df.groupby('typeClass', as_index = False).count().min().entry

Eşit sayıda entry olacak şekilde yeni bir dataframe olusturulur

In [0]:
analysts_df = df[df['typeClass'] == 'analysts']
analysts_df = analysts_df.iloc[0 : min_entry]

analysts_df.shape[0]


In [0]:
explorers_df = df[df['typeClass'] == 'explorers']
explorers_df = explorers_df.iloc[0 : min_entry]

explorers_df.shape[0]

In [0]:
sentinels_df = df[df['typeClass'] == 'sentinels']
sentinels_df = sentinels_df.iloc[0 : min_entry]

sentinels_df.shape[0]

In [0]:
diplomats_df = df[df['typeClass'] == 'diplomats']

diplomats_df = diplomats_df.iloc[0 : min_entry]

diplomats_df.shape[0]

Yeni bir dataframe oluşturulur.

In [0]:
equal_entries_df = pd.concat([diplomats_df, sentinels_df, explorers_df, analysts_df]).reset_index(drop=True)
equal_entries_df

Oluşturulan dataframe shuffle edilir.

In [0]:
equal_entries_df = equal_entries_df.sample(frac=1).reset_index(drop=True)

In [0]:
equal_entries_df

In [0]:
from sklearn.naive_bayes import MultinomialNB

iteration = 1

for train_indices, test_indices in kFold.split(equal_entries_df):    
  print("Started iteration: {}".format(iteration))
  train = equal_entries_df.iloc[train_indices]
  X_train = train['entry']
  y_train = train['typeClass']

  test  = equal_entries_df.iloc[test_indices]
  X_test = test['entry']    
  y_test = test['typeClass']

  X_train_count = count_vectorizer.fit_transform(X_train)
  X_test_count = count_vectorizer.transform(X_test)

  tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
  X_train_tf = tf_transformer.transform(X_train_count)
  X_test_tf = tf_transformer.transform(X_test_count)

  clf = MultinomialNB().fit(X_train_tf, y_train)
  y_test = y_test.values
  y_predicted = clf.predict(X_test_tf)

  print("Finished iteration: {}".format(iteration))
  iteration += 1
  
  for i in range(len(y_predicted)):
    actual = y_test[i]
    predicted = y_predicted[i]
    results['predicted'][predicted]['actual'][actual] += 1


**I/E** boyutu tahmin edilir

Eşit sayıda entry olacak şekilde yeni bir dataframe olusturulur

In [0]:
min_entry = df.groupby('I/E', as_index = False).count().min().entry

I_df = df[df['I/E'] == 'I']
I_df = I_df.iloc[0 : min_entry]

I_df.shape[0]


In [0]:
E_df = df[df['I/E'] == 'E']

E_df = E_df.iloc[0 : min_entry]

E_df.shape[0]

Yeni bir dataframe oluşturulur.

In [0]:
equal_entries_df = pd.concat([I_df, E_df]).reset_index(drop=True)
equal_entries_df

In [0]:
iteration = 1

for train_indices, test_indices in kFold.split(equal_entries_df):    
  print("Started iteration: {}".format(iteration))
  train = equal_entries_df.iloc[train_indices]
  X_train = train['entry']
  y_train = train['I/E']

  test  = equal_entries_df.iloc[test_indices]
  X_test = test['entry']    
  y_test = test['I/E']

  X_train_count = count_vectorizer.fit_transform(X_train)
  X_test_count = count_vectorizer.transform(X_test)

  tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
  X_train_tf = tf_transformer.transform(X_train_count)
  X_test_tf = tf_transformer.transform(X_test_count)

  clf = MultinomialNB().fit(X_train_tf, y_train)
  y_test = y_test.values
  y_predicted = clf.predict(X_test_tf)

  print("Finished iteration: {}".format(iteration))
  iteration += 1
  
  for i in range(len(y_predicted)):
    actual = y_test[i]
    predicted = y_predicted[i]
    results['predicted'][predicted]['actual'][actual] += 1

**S/N** boyutu tahmin edilir

Eşit sayıda entry olacak şekilde yeni bir dataframe olusturulur

In [0]:
min_entry = df.groupby('S/N', as_index = False).count().min().entry

S_df = df[df['S/N'] == 'S']
S_df = S_df.iloc[0 : min_entry]

S_df.shape[0]


In [0]:
N_df = df[df['S/N'] == 'N']

N_df = N_df.iloc[0 : min_entry]

N_df.shape[0]

Yeni bir dataframe oluşturulur.

In [0]:
equal_entries_df = pd.concat([S_df, N_df]).reset_index(drop=True)
equal_entries_df

In [0]:
iteration = 1

for train_indices, test_indices in kFold.split(equal_entries_df):    
  print("Started iteration: {}".format(iteration))
  train = equal_entries_df.iloc[train_indices]
  X_train = train['entry']
  y_train = train['S/N']

  test  = equal_entries_df.iloc[test_indices]
  X_test = test['entry']    
  y_test = test['S/N']

  X_train_count = count_vectorizer.fit_transform(X_train)
  X_test_count = count_vectorizer.transform(X_test)

  tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
  X_train_tf = tf_transformer.transform(X_train_count)
  X_test_tf = tf_transformer.transform(X_test_count)

  clf = MultinomialNB().fit(X_train_tf, y_train)
  y_test = y_test.values
  y_predicted = clf.predict(X_test_tf)

  print("Finished iteration: {}".format(iteration))
  iteration += 1
  
  for i in range(len(y_predicted)):
    actual = y_test[i]
    predicted = y_predicted[i]
    results['predicted'][predicted]['actual'][actual] += 1


**T/F** boyutu tahmin

Eşit sayıda entry olacak şekilde yeni bir dataframe olusturulur

In [0]:
min_entry = df.groupby('T/F', as_index = False).count().min().entry

T_df = df[df['T/F'] == 'T']
T_df = T_df.iloc[0 : min_entry]

T_df.shape[0]


In [0]:
F_df = df[df['T/F'] == 'F']

F_df = F_df.iloc[0 : min_entry]

F_df.shape[0]

Yeni bir dataframe oluşturulur.

In [0]:
equal_entries_df = pd.concat([T_df, F_df]).reset_index(drop=True)
equal_entries_df

In [0]:
iteration = 1

for train_indices, test_indices in kFold.split(equal_entries_df):    
  print("Started iteration: {}".format(iteration))
  train = equal_entries_df.iloc[train_indices]
  X_train = train['entry']
  y_train = train['T/F']

  test  = equal_entries_df.iloc[test_indices]
  X_test = test['entry']    
  y_test = test['T/F']

  X_train_count = count_vectorizer.fit_transform(X_train)
  X_test_count = count_vectorizer.transform(X_test)

  tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
  X_train_tf = tf_transformer.transform(X_train_count)
  X_test_tf = tf_transformer.transform(X_test_count)

  clf = MultinomialNB().fit(X_train_tf, y_train)
  y_test = y_test.values
  y_predicted = clf.predict(X_test_tf)

  print("Finished iteration: {}".format(iteration))
  iteration += 1
  
  for i in range(len(y_predicted)):
    actual = y_test[i]
    predicted = y_predicted[i]
    results['predicted'][predicted]['actual'][actual] += 1


**J/P** boyutu tahmin

Eşit sayıda entry olacak şekilde yeni bir dataframe olusturulur

In [0]:
min_entry = df.groupby('J/P', as_index = False).count().min().entry

J_df = df[df['J/P'] == 'J']
J_df = J_df.iloc[0 : min_entry]

J_df.shape[0]


In [0]:
P_df = df[df['J/P'] == 'P']

P_df = P_df.iloc[0 : min_entry]

P_df.shape[0]

Yeni bir dataframe oluşturulur.

In [0]:
equal_entries_df = pd.concat([J_df, P_df]).reset_index(drop=True)
equal_entries_df

In [0]:
iteration = 1

for train_indices, test_indices in kFold.split(equal_entries_df):    
  print("Started iteration: {}".format(iteration))
  train = equal_entries_df.iloc[train_indices]
  X_train = train['entry']
  y_train = train['J/P']

  test  = equal_entries_df.iloc[test_indices]
  X_test = test['entry']    
  y_test = test['J/P']

  X_train_count = count_vectorizer.fit_transform(X_train)
  X_test_count = count_vectorizer.transform(X_test)

  tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_count)
  X_train_tf = tf_transformer.transform(X_train_count)
  X_test_tf = tf_transformer.transform(X_test_count)

  clf = MultinomialNB().fit(X_train_tf, y_train)
  y_test = y_test.values
  y_predicted = clf.predict(X_test_tf)

  print("Finished iteration: {}".format(iteration))
  iteration += 1
  
  for i in range(len(y_predicted)):
    actual = y_test[i]
    predicted = y_predicted[i]
    results['predicted'][predicted]['actual'][actual] += 1


In [0]:
results