In [1]:
import pandas as pd

In [2]:
#to load dataset in a Pandas DataFrame
#be sure to use the path to the stored CSV
file_path = "data/iris.csv"
iris_df = pd.read_csv(file_path)
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [4]:
#drop the class field:
new_iris_df = iris_df.drop(['class'], axis=1)
new_iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [7]:
new_iris_df = iris_df[['sepal_length', 'petal_length', 'sepal_width', 'petal_width']]
new_iris_df.head()

Unnamed: 0,sepal_length,petal_length,sepal_width,petal_width
0,5.1,1.4,3.5,0.2
1,4.9,1.4,3.0,0.2
2,4.7,1.3,3.2,0.2
3,4.6,1.5,3.1,0.2
4,5.0,1.4,3.6,0.2


Finally, the preprocessed DataFrame is saved on a new CSV file for future use. This is done by storing the file path in a variable, then using the Pandas to_csv() method to export the DataFrame to a CSV by supplying the file path and file name as arguments, as shown below:

In [8]:
output_file_path = "data/new_iris_data.csv"
new_iris_df.to_csv(output_file_path, index=False)

## 18.2.3_Preprocessing Data with Pandas

In [9]:
# Data loading
file_path = "data/shopping_data.csv"
df_shopping = pd.read_csv(file_path, encoding="ISO-8859-1")
df_shopping.head()

Unnamed: 0,CustomerID,Card Member,Age,Annual Income,Spending Score (1-100)
0,1,Yes,19.0,15000,39.0
1,2,Yes,21.0,15000,81.0
2,3,No,20.0,16000,6.0
3,4,No,23.0,16000,77.0
4,5,No,31.0,17000,40.0


### What knowledge do we hope to glean from running an unsupervised learning model on this dataset?

It's a shopping dataset, so we can group together shoppers based on spending habits.

### What data is available?

In [10]:
# columns
df_shopping.columns

Index(['CustomerID', 'Card Member', 'Age', 'Annual Income',
       'Spending Score (1-100)'],
      dtype='object')

### What type of data is available?

In [11]:
#all columns we plan to use in our model MUST contain a numerical data type.

#list dataframe data types
df_shopping.dtypes

CustomerID                  int64
Card Member                object
Age                       float64
Annual Income               int64
Spending Score (1-100)    float64
dtype: object

### What data is missing?
Next, let's see if any data is missing. Unsupervised learning models can't handle missing data. If you try to run a model on a dataset with missing data, you'll get an error such as the one below:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [14]:
#pandas has the isnull() method to check for missing values. We'll loop through each column, check if there are null values
#sum them up, and print out a readable total:

#find all null values
for column in df_shopping.columns:
    print(f"Column {column} has {df_shopping[column].isnull().sum()} null values")

Column CustomerID has 0 null values
Column Card Member has 2 null values
Column Age has 2 null values
Column Annual Income has 0 null values
Column Spending Score (1-100) has 1 null values


### What data can be removed?
consider:
    
    Are there string columns that we can't use?
    Are there collumns with excessive null data points?
    Was our decision to handle missing values to just remove them?

In [15]:
#rows of data with null values can be removed with the dropna() method

#drop null rows
df_shopping = df_shopping.dropna()

In [16]:
#use the duplicate().sum() method to check for duplicates

#find duplicate entries
print(f"Duplicate entries: {df_shopping.duplicated().sum()}")

Duplicate entries: 0


In [17]:
#remove the columns that do not offer insight into customer shopping habits (or whatever data you are exploring)

#remove the CustomerID Column
df_shopping.drop(columns=["CustomerID"], inplace=True)
df_shopping.head()

Unnamed: 0,Card Member,Age,Annual Income,Spending Score (1-100)
0,Yes,19.0,15000,39.0
1,Yes,21.0,15000,81.0
2,No,20.0,16000,6.0
3,No,23.0,16000,77.0
4,No,31.0,17000,40.0


## 18.2.5_Data Processing

For data processing, the focus is on making sure the data is set up for the unsupervised learning model, which requires the following:

- Null values are handled.
- Only numerical data is used.
- Values are scaled. In other words, data has been manipulated to ensure that the variance between the numbers won't skew results.

### Is the model in a format that can be passed into a unsupervised learning model?

In [19]:
#transform String Column
def change_string(member):
    if member == "Yes":
        return 1
    else:
        return 0
    
df_shopping["Card Member"] = df_shopping["Card Member"].apply(change_string)
df_shopping.head()

Unnamed: 0,Card Member,Age,Annual Income,Spending Score (1-100)
0,1,19.0,15000,39.0
1,1,21.0,15000,81.0
2,0,20.0,16000,6.0
3,0,23.0,16000,77.0
4,0,31.0,17000,40.0


In [20]:
#Transform annual income
df_shopping["Annual Income"] = df_shopping["Annual Income"] / 1000
df_shopping.head()

Unnamed: 0,Card Member,Age,Annual Income,Spending Score (1-100)
0,1,19.0,15.0,39.0
1,1,21.0,15.0,81.0
2,0,20.0,16.0,6.0
3,0,23.0,16.0,77.0
4,0,31.0,17.0,40.0


## 18.2.6_Data Transformation
Transforming your data into a convenient way for others to use in the future.

### Can I quickly hand off this data for others to use?

In [21]:
#convert final product into a CSV or Excel file.

#saving cleaned data
file_path = "data/shopping_data_cleaned.csv"
df_shopping.to_csv(file_path, index=False)