[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ignaziogallo/data-mining/blob/aa20-21/tutorials/data_types/Attributes_Types.ipynb)

# Types of Attributes

In [1]:
import pandas as pd

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

- Import Pandas in Python with the `pd` alias. 
- Use `.read_csv()` to read the dataset into a `DataFrame` object `iris`.

- You can see how much data `iris` contains:

In [2]:
print("Dataset shape (rows, columns):", iris.shape)

Dataset shape (rows, columns): (150, 5)


- use the `.shape` attribute of the `DataFrame` to see its dimensionality.
- Now you know that there are 150 rows and 5 columns in your dataset. 
- You can have a look at the first five rows with `.head()`:

In [4]:
iris.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


- `Pandas` assigns a data type to each column based on its values. 
- While it does a pretty good job, it’s not perfect. 

In [5]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
species         150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


The column `species` contain text. 

The `species` column can have only three different values:

In [6]:
print("nunique:",iris["species"].nunique())
iris["species"].value_counts()

nunique: 3


virginica     50
versicolor    50
setosa        50
Name: species, dtype: int64

Pandas provides the **categorical** data type for the same purpose:

In [7]:
iris["species"] = pd.Categorical(iris["species"])
iris["species"].dtype

CategoricalDtype(categories=['setosa', 'versicolor', 'virginica'], ordered=False)

categorical data has a few advantages over unstructured text. 
- you make validation easier
- save a ton of memory, as Pandas will only use the unique values internally. 

Run `iris.info()` again, you should see that changing the `species` data type from `object` to categorical has decreased the memory usage.

In [8]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
species         150 non-null category
dtypes: category(1), float64(4)
memory usage: 5.0 KB


In [9]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


## Create a Mixed Dataset

In [10]:
import pandas as pd

df = pd.DataFrame([
['German', 'Female', 23, 'marketing'],
['German', 'Female', 25, 'economics'],
['Belgian', 'Male', 18, 'strategy'],
['Italian', 'Female', 24, 'strategy'],
['Italian', 'Male', 48, 'economics'],
])
df.columns = ['nationality', 'gender', 'age', 'major']

In [12]:
df.head()

Unnamed: 0,nationality,gender,age,major
0,German,Female,23,marketing
1,German,Female,25,economics
2,Belgian,Male,18,strategy
3,Italian,Female,24,strategy
4,Italian,Male,48,economics


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
nationality    5 non-null object
gender         5 non-null object
age            5 non-null int64
major          5 non-null object
dtypes: int64(1), object(3)
memory usage: 240.0+ bytes


We can transform some attributes into **categorical**.

Using the Pandas `get_dummies()` returns a dataframe with the column passed in returned as dummy variables.

In [14]:
df["nationality"] = pd.Categorical(df["nationality"])
df["gender"] = pd.Categorical(df["gender"])
df["major"] = pd.Categorical(df["major"])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
nationality    5 non-null category
gender         5 non-null category
age            5 non-null int64
major          5 non-null category
dtypes: category(3), int64(1)
memory usage: 439.0 bytes


## One-hot encoding

One-hot encoding turns your categorical data into a binary vector representation. 

This means that for each unique value in a column, a new column is created. The values in this column are represented as 1s and 0s, depending on whether the value matches the column header.

In [15]:
pd.get_dummies(df[['nationality', 'gender']])

Unnamed: 0,nationality_Belgian,nationality_German,nationality_Italian,gender_Female,gender_Male
0,0,1,0,1,0
1,0,1,0,1,0
2,1,0,0,0,1
3,0,0,1,1,0
4,0,0,1,0,1


Now, we need to merge this new DataFrame back into the previous DataFrame.

In [16]:
categorical_columns = ['nationality', 'gender']
for column in categorical_columns:
    tempdf = pd.get_dummies(df[column], prefix=column)
    df = pd.merge(
        left=df,
        right=tempdf,
        left_index=True,
        right_index=True,
    )
    df = df.drop(columns=column)

In [17]:
df.head()

Unnamed: 0,age,major,nationality_Belgian,nationality_German,nationality_Italian,gender_Female,gender_Male
0,23,marketing,0,1,0,1,0
1,25,economics,0,1,0,1,0
2,18,strategy,1,0,0,0,1
3,24,strategy,0,0,1,1,0
4,48,economics,0,0,1,0,1


## Binning
* We can Map numeric column "age" into categories with Pandas `cut`.
* Now let's group each person into different categories based on their age.
* We need to define the category boundaries before mapping.

Let see how to map each person in one of these categories

In [18]:
bins = [15, 20, 25, 50]
category = ['junior', 'mid', 'senior']
df['experience'] = pd.cut(df['age'], bins, labels=category)
df.head()

Unnamed: 0,age,major,nationality_Belgian,nationality_German,nationality_Italian,gender_Female,gender_Male,experience
0,23,marketing,0,1,0,1,0,mid
1,25,economics,0,1,0,1,0,mid
2,18,strategy,1,0,0,0,1,junior
3,24,strategy,0,0,1,1,0,mid
4,48,economics,0,0,1,0,1,senior


We can count the numbers of categorical values, and if we want, we can change the bins size

In [19]:
df['experience'].value_counts()

mid       3
senior    1
junior    1
Name: experience, dtype: int64

## Load set of items
###     market-basket data

[GroceryStore DataSet](https://www.kaggle.com/shazadudwadia/supermarket#GroceryStoreDataSet.csv)
It's a small data set about some breakfast items bought from some store.

In [20]:
groceries = pd.read_csv("https://raw.githubusercontent.com/sbkaracan/association_rule_groceryDataset/master/GroceryStoreDataSet.csv", names = ['products'], sep = ',')

print(groceries.shape)
groceries.head()

(20, 1)


Unnamed: 0,products
0,"MILK,BREAD,BISCUIT"
1,"BREAD,MILK,BISCUIT,CORNFLAKES"
2,"BREAD,TEA,BOURNVITA"
3,"JAM,MAGGI,BREAD,MILK"
4,"MAGGI,TEA,BISCUIT"


* What we need to do is split each row and assigning one product with a transaction id. 
* Let's split the products and create a list called by 'data',

In [21]:
data = list(groceries["products"].apply(lambda x:x.split(",") ))
print(data)

[['MILK', 'BREAD', 'BISCUIT'], ['BREAD', 'MILK', 'BISCUIT', 'CORNFLAKES'], ['BREAD', 'TEA', 'BOURNVITA'], ['JAM', 'MAGGI', 'BREAD', 'MILK'], ['MAGGI', 'TEA', 'BISCUIT'], ['BREAD', 'TEA', 'BOURNVITA'], ['MAGGI', 'TEA', 'CORNFLAKES'], ['MAGGI', 'BREAD', 'TEA', 'BISCUIT'], ['JAM', 'MAGGI', 'BREAD', 'TEA'], ['BREAD', 'MILK'], ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'], ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'], ['COFFEE', 'SUGER', 'BOURNVITA'], ['BREAD', 'COFFEE', 'COCK'], ['BREAD', 'SUGER', 'BISCUIT'], ['COFFEE', 'SUGER', 'CORNFLAKES'], ['BREAD', 'SUGER', 'BOURNVITA'], ['BREAD', 'COFFEE', 'SUGER'], ['BREAD', 'COFFEE', 'SUGER'], ['TEA', 'MILK', 'COFFEE', 'CORNFLAKES']]


* Using TransactionEncoder, we convert the list to a One-Hot Encoded Boolean list.
* Products that customers bought or did not buy during shopping will now be represented by values 1 and 0.

In [22]:
#Let's transform the list, with one-hot encoding
from mlxtend.preprocessing import TransactionEncoder
a = TransactionEncoder()
a_data = a.fit(data).transform(data)
df = pd.DataFrame(a_data,columns=a.columns_)
df = df.replace(False,0)
df

Unnamed: 0,BISCUIT,BOURNVITA,BREAD,COCK,COFFEE,CORNFLAKES,JAM,MAGGI,MILK,SUGER,TEA
0,True,False,True,False,False,False,False,False,True,False,False
1,True,False,True,False,False,True,False,False,True,False,False
2,False,True,True,False,False,False,False,False,False,False,True
3,False,False,True,False,False,False,True,True,True,False,False
4,True,False,False,False,False,False,False,True,False,False,True
5,False,True,True,False,False,False,False,False,False,False,True
6,False,False,False,False,False,True,False,True,False,False,True
7,True,False,True,False,False,False,False,True,False,False,True
8,False,False,True,False,False,False,True,True,False,False,True
9,False,False,True,False,False,False,False,False,True,False,False


### Text documents as bag-of-words

We can use the `CountVectorizer()` function from the `Sk-learn` library to easily implement the above `BoW` model using Python.

In [24]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
 
sentence_1="This is a good job.I will not miss it for anything"
sentence_2="This is not good at all all"
 
count_vec = CountVectorizer()

#transform documents into BoW
count_data = count_vec.fit_transform([sentence_1,sentence_2])
 
#create dataframe
cv_dataframe=pd.DataFrame(count_data.toarray(),columns=count_vec.get_feature_names())
cv_dataframe

Unnamed: 0,all,anything,at,for,good,is,it,job,miss,not,this,will
0,0,1,0,1,1,1,1,1,1,1,1,1
1,2,0,1,0,1,1,0,0,0,1,1,0
