# Data Preprocessing

The goal of this lab is to introduce you to data preprocessing techniques in order to make your data suitable for applying a learning algorithm.

## 1. Handling Missing Values

A common (and very unfortunate) data property is the ocurrence of missing and erroneous values in multiple features in our dataset.
Download the dataset and corresponding information from the <a href="http://www.cs.uni-potsdam.de/ml/teaching/ss15/ida/uebung02/abalone.csv">course website</a>.

To determine the age of a abalone snail you have to kill the snail and count the annual
rings. You are told to estimate the age of a snail on the basis of the following attributes:
1. type: male (0), female (1) and infant (2)
2. length in mm
3. width in mm
4. height in mm
5. total weight in grams
6. weight of the meat in grams
7. drained weight in grams
8. weight of the shell in grams
9. number of annual rings (number of rings +1, 5 yields age)

However, these data is incomplete. Missing values are marked with −1.

In [8]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np
import matplotlib.pyplot as plt

# load data 
df1 = pd.read_csv("abalone.csv")
df1.columns=['type','length','width','height','total_weight','meat_weight','drained_weight','shell_weight','num_rings']
df1.head()

Unnamed: 0,type,length,width,height,total_weight,meat_weight,drained_weight,shell_weight,num_rings
0,0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,-1
1,1,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,0,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,2,-1.0,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,2,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


### Exercise 1.1

Compute the mean of all positive numbers of each numeric column and the counts of each category.

In [19]:
print(df1.groupby(['type'])[['type']].count())

      type
type      
-1      87
 0    1500
 1    1279
 2    1310


In [9]:
print("Mean of all positive numbers of each numeric column\n")
print(df1[df1.iloc[:,1:8] > 0].mean()[1:8],"\n")


print(df1[df1.iloc[:,1:8] > 0].count()[1:8],"\n")
print("The counts of each category\n")
print(df1.groupby(['type'])[['type']].count())

Mean of all positive numbers of each numeric column



ValueError: Boolean array expected for the condition, not float64

### Exercise 1.2

Compute the median of all positive numbers of each numeric column.

In [11]:
print("Median of all positive numbers of each numeric column\n")
print(df1[df1.iloc[:,1:9] >= 0].median()[1:9],"\n")

Median of all positive numbers of each numeric column



ValueError: Boolean array expected for the condition, not float64

### Exercise 1.3

Handle the missing values in a way that you find suitable. Argue your choices.

In [6]:
from sklearn.preprocessing import Imputer

In [72]:
imp = Imputer(missing_values = -1,strategy = "median", axis=0)



In [7]:
imp.fit(df1)

NameError: name 'imp' is not defined

In [77]:
df_cleaned = pd.DataFrame(data = imp.transform(df1), columns=df.columns)

In [78]:
df_cleaned

Unnamed: 0,type,length,width,height,total_weight,meat_weight,drained_weight,shell_weight,num_rings
0,0.0,0.350,0.265,0.090,0.22550,0.0995,0.0485,0.0700,9.0
1,1.0,0.530,0.420,0.135,0.67700,0.2565,0.1415,0.2100,9.0
2,0.0,0.440,0.365,0.125,0.51600,0.2155,0.1140,0.1550,10.0
3,2.0,0.545,0.255,0.080,0.20500,0.0895,0.0395,0.0550,7.0
4,2.0,0.425,0.300,0.095,0.35150,0.1410,0.0775,0.1200,8.0
5,1.0,0.530,0.415,0.150,0.77750,0.2370,0.1415,0.3300,20.0
6,1.0,0.545,0.425,0.125,0.76800,0.2940,0.1495,0.2600,16.0
7,0.0,0.475,0.370,0.125,0.50950,0.2165,0.1125,0.1650,9.0
8,1.0,0.550,0.440,0.150,0.89450,0.3145,0.1510,0.3200,19.0
9,1.0,0.545,0.380,0.140,0.60650,0.1940,0.1475,0.2335,14.0


In [102]:
#transorm to numpy array
#data
X = df_cleaned.loc[df_cleaned["type"] >= 0,"length":].values
#target
y = df_cleaned.loc[df1["type"] >= 0,"type"].values
X_missing = df_cleaned.loc[df1["type"] == -1,"length":].values

In [80]:
print("about",np.round(X_missing.shape[0] / df1.shape[0] * 100, decimals = 2),"% the data is missing")

about 0.0 % the data is missing


I dropped it because first it only missing 2% and other classifier cant predict significantlty the missing gender

In [81]:
df1.loc[df1["type"] < 0,"type"] = np.nan

In [82]:
df1 = df1.dropna()

In [83]:
#prepare the data for modeling
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.1, random_state=42, stratify=y )

In [84]:
#KneighborClassifiert
knn = KNeighborsClassifier(n_neighbors=14)
knn.fit(X_train,y_train)
#print(knn.score(X_test, y_test))
y_predict = knn.predict(X_test)
print("Confusion Matrix")
print(classification_report(y_test,y_predict))

Confusion Matrix
              precision    recall  f1-score   support

         0.0       0.48      0.58      0.53       150
         1.0       0.48      0.37      0.42       137
         2.0       0.75      0.75      0.75       131

   micro avg       0.56      0.56      0.56       418
   macro avg       0.57      0.57      0.56       418
weighted avg       0.56      0.56      0.56       418



In [85]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_predict = logreg.predict(X_test)
print(classification_report(y_test,y_predict))

              precision    recall  f1-score   support

         0.0       0.49      0.50      0.49       150
         1.0       0.55      0.36      0.44       137
         2.0       0.64      0.85      0.73       131

   micro avg       0.56      0.56      0.56       418
   macro avg       0.56      0.57      0.55       418
weighted avg       0.56      0.56      0.55       418





In [11]:
#if those ware predictable:
#y_filling = model.predict(X_missing)
#cleaned_df = pd.DataFrame(data= np.c_[y_filling, X_missing],columns = df.columns)

### Exercise 1.4

Perform Z-score normalization on every column (except the type of course!)

In [88]:
means = list(df_cleaned.loc[:,"length":"num_rings"].mean())
std   = list(df_cleaned.loc[:,"length":"num_rings"].std())

print("Mean ",df_cleaned.loc[:,"length":"num_rings"].mean())
print("Standard Dev",df_cleaned.loc[:,"length":"num_rings"].std())
cols = list(df_cleaned.columns)
cols.remove('type')

Mean  length            0.524325
width             0.408461
height            0.139622
total_weight      0.828155
meat_weight       0.358566
drained_weight    0.179994
shell_weight      0.238480
num_rings         9.899904
dtype: float64
Standard Dev length            0.118541
width             0.097864
height            0.041100
total_weight      0.483435
meat_weight       0.219277
drained_weight    0.107903
shell_weight      0.137423
num_rings         3.192062
dtype: float64


In [103]:
for col in cols:
    col_zscore = col + '_zscore'
    #forula for Z Score
    df_cleaned[col_zscore] = (df_cleaned[col] - df_cleaned[col].mean())/df_cleaned[col].std()
    
print(df_cleaned.head())


   type  length  width  height  total_weight  meat_weight  drained_weight  \
0   0.0   0.350  0.265   0.090        0.2255       0.0995          0.0485   
1   1.0   0.530  0.420   0.135        0.6770       0.2565          0.1415   
2   0.0   0.440  0.365   0.125        0.5160       0.2155          0.1140   
3   2.0   0.545  0.255   0.080        0.2050       0.0895          0.0395   
4   2.0   0.425  0.300   0.095        0.3515       0.1410          0.0775   

   shell_weight  num_rings  length_zscore  width_zscore  height_zscore  \
0         0.070        9.0      -1.470582     -1.465932      -1.207344   
1         0.210        9.0       0.047876      0.117904      -0.112449   
2         0.155       10.0      -0.711353     -0.444102      -0.355759   
3         0.055        7.0       0.174414     -1.568115      -1.450653   
4         0.120        8.0      -0.837891     -1.108292      -1.085689   

   total_weight_zscore  meat_weight_zscore  drained_weight_zscore  \
0            -1.246610 

## 2. Preprocessing text (Optional)

One possible way to transform text documents into vectors of numeric attributes is to use the TF-IDF representation. We will experiment with this representation using the 20 Newsgroup data set. The data set contains postings on 20 different topics. The classification problem is to decide which of the topics a posting falls into. Here, we will only consider postings about medicine and space.

In [91]:
from sklearn.datasets import fetch_20newsgroups
import math

categories = ['sci.med', 'sci.space']
raw_data = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
print('The index of each category is: {}'.format([(i,target) for i,target in enumerate(raw_data.target_names)]))

The index of each category is: [(0, 'sci.med'), (1, 'sci.space')]


Check out some of the postings, might find some funny ones!

In [92]:
import numpy as np
idx = np.random.randint(0, len(raw_data.data))
print ('This is a {} email.\n'.format(raw_data.target_names[raw_data.target[idx]]))
print ('There are {} emails.\n'.format(len(raw_data.data)))
print(raw_data.data[idx])

This is a sci.space email.

There are 1187 emails.

From: dbm0000@tm0006.lerc.nasa.gov (David B. Mckissock)
Subject: Washington Post Article on SSF Redesign
News-Software: VAX/VMS VNEWS 1.41    
Nntp-Posting-Host: tm0006.lerc.nasa.gov
Organization: NASA Lewis Research Center / Cleveland, Ohio
Lines: 52

"Space Station Redesign Leader Says Cost Goal May Be
Impossible"

Today (4/6) the Washington Post ran an article with the
headline shown above. The article starts with "A leader
of the NASA team in charge of redesigning the planned
space station said yesterday the job is tough and may
be impossible." O'Connor is quoted saying whether it is
possible to cut costs that much and still provide for
meaningful research "is a real question for me."
O'Connor said "everything is fair game," including
"dropping or curtailing existing contracts with the
aerospace industry, chopping management of the space
station program at some NASA facilities around the
country, working closely with the Russian s

Lets pick the first 10 postings from each category

In [93]:
idxs_med = np.flatnonzero(raw_data.target == 0)
idxs_space = np.flatnonzero(raw_data.target == 1)
idxs = np.concatenate([idxs_med[:10],idxs_space[:10]])
data = np.array(raw_data.data)
data = data[idxs]

<a href="http://www.nltk.org/">NLTK</a> is a toolkit for natural language processing. Take some time to install it and go through this <a href="http://www.slideshare.net/japerk/nltk-in-20-minutes">short tutorial/presentation</a>.

The downloaded package below is a tokenizer that divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.

In [94]:
import nltk
import itertools
nltk.download('punkt')

# Tokenize the sentences into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in data]
vocabulary_size = 1000
unknown_token = 'unknown'

[nltk_data] Downloading package punkt to /home/pandoora/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [95]:
# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print ("Found %d unique words tokens." % len(word_freq.items()))

Found 1641 unique words tokens.


In [97]:
# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])
 
print ("Using vocabulary size %d." % vocabulary_size)
print ("The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1]))

Using vocabulary size 1000.
The least frequent word in our vocabulary is 'REASONS' and appeared 1 times.


In [98]:
import collections
od = collections.OrderedDict(sorted(word_to_index.items()))

### Exercise 2.1

Code your own TF-IDF representation function and use it on this dataset. (Don't use code from libraries. Build your own function with Numpy/Pandas). Use the formular TFIDF = TF * (IDF+1). The effect of adding “1” to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored.

In [99]:
from sklearn.feature_extraction.text import CountVectorizer
countvec = CountVectorizer()
df = pd.DataFrame(countvec.fit_transform(data).toarray(), columns=countvec.get_feature_names())


def tfidf(df):
    for col in list(df.columns):
        #formula for for idf
        df[col] = df[col] * (math.log(df.shape[0]/sum(df[col] != 0)) +1)    
    return df

rep = tfidf(df)

In [101]:
# Check if your implementation is correct
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(norm=None, smooth_idf=False, use_idf=True)
X_train = pd.DataFrame(vectorizer.fit_transform(data).toarray(), columns=countvec.get_feature_names())
answer=['No','Yes']
if rep is not None:
    print ('Is this implementation correct?\nAnswer: {}'.format(answer[1*np.all(X_train == rep)]))

Is this implementation correct?
Answer: Yes
