## Dimensionality Reduction Using Factor Analysis

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
import time
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
adData = pd.read_csv('ad_data.csv', sep=',', header=None, error_bad_lines=False)
adData.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1549,1550,1551,1552,1553,1554,1555,1556,1557,1558
0,125,125,1.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
1,57,468,8.2105,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
2,33,230,6.9696,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
3,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.
4,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,ad.


The pd.read_csv() function's arguments are the filename as a string and the limit separator of a CSV file, which is ",". Please note that as there are no headers for the dataset. We specifically mention this using the header = None command. The last argument, **error_bad_lines=False**, is to skip any errors in the format of the file and then load data.

In [3]:
# dataset shape
print(adData.shape)

(3279, 1559)


In [4]:
# summarizing the statistics of the numerical raw data
adData.describe()

Unnamed: 0,4,5,6,7,8,9,10,11,12,13,...,1548,1549,1550,1551,1552,1553,1554,1555,1556,1557
count,3279.0,3279.0,3279.0,3279.0,3279.0,3279.0,3279.0,3279.0,3279.0,3279.0,...,3279.0,3279.0,3279.0,3279.0,3279.0,3279.0,3279.0,3279.0,3279.0,3279.0
mean,0.00427,0.011589,0.004575,0.003355,0.003965,0.011589,0.003355,0.00488,0.009149,0.004575,...,0.006099,0.004575,0.00366,0.00244,0.00305,0.006404,0.012809,0.013419,0.009759,0.001525
std,0.065212,0.107042,0.067491,0.057831,0.06285,0.107042,0.057831,0.069694,0.095227,0.067491,...,0.077872,0.067491,0.060393,0.049341,0.055148,0.079783,0.112466,0.115077,0.09832,0.039026
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


As we saw from the shape of the data, the dataset has 3279 examples with 1559 variables. The variable set has both categorical and numerical variables. The summary statistics are only derived for numerical data.

In [5]:
# separate dependent and independent variables
# preparing X variables
X = adData.loc[:,0:1557] # got the features from output above
print(X.shape)

# preparing y variable
y = adData[1558]
print(y.shape)

(3279, 1558)
(3279,)


In [6]:
# head of independent variables
X.head(15)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1548,1549,1550,1551,1552,1553,1554,1555,1556,1557
0,125,125,1.0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,57,468,8.2105,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,33,230,6.9696,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,59,460,7.7966,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,60,234,3.9,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,60,468,7.8,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


From the output, we can see that there are many missing values in the dataset, which are represented by **?**. For further analysis, we have to remove these special characters and then replace those cells with assumed values. One popular method of replacing special characters is to impute the mean of the respective feature. Let's adopt this strategy. However, before doing that, let's look at the data types for this dataset to adopt a suitable replacement strategy.

In [7]:
# printing data types of dataset
print(X.dtypes)

0       object
1       object
2       object
3       object
4        int64
         ...  
1553     int64
1554     int64
1555     int64
1556     int64
1557     int64
Length: 1558, dtype: object


In [8]:
# replacing special characters with NaN values for the first 3 columns which are of type object
for i in range(0,3):
    X[i] = X[i].str.replace('?', 'nan').values.astype(float)
print(X.head(15))

     0      1       2    3     4     5     6     7     8     9     ...  1548  \
0   125.0  125.0  1.0000    1     0     0     0     0     0     0  ...     0   
1    57.0  468.0  8.2105    1     0     0     0     0     0     0  ...     0   
2    33.0  230.0  6.9696    1     0     0     0     0     0     0  ...     0   
3    60.0  468.0  7.8000    1     0     0     0     0     0     0  ...     0   
4    60.0  468.0  7.8000    1     0     0     0     0     0     0  ...     0   
5    60.0  468.0  7.8000    1     0     0     0     0     0     0  ...     0   
6    59.0  460.0  7.7966    1     0     0     0     0     0     0  ...     0   
7    60.0  234.0  3.9000    1     0     0     0     0     0     0  ...     0   
8    60.0  468.0  7.8000    1     0     0     0     0     0     0  ...     0   
9    60.0  468.0  7.8000    1     0     0     0     0     0     0  ...     0   
10    NaN    NaN     NaN    1     0     0     0     0     0     0  ...     0   
11   90.0   52.0  0.5777    1     0     

To replace the first three columns, we loop through the columns using the **for() loop** and also using the **range()** function. Since the first three columns are of the **object** or **string type**, we use the **.str.replace()** function, which stands for "string replace". After replacing the special characters, **?**, of the data with nan, we convert the data type to **float** with the **.values.astype(float)** function, which is required for further processing. By printing the first 15 examples, we can see that all special characters have been replaced with **nan** or **NaN** values

In [9]:
# replacing special characters in the remaining columns which are of type integer
for i in range(3, 1557):
    X[i] = X[i].replace('?', 'NaN').values.astype(float)

Now that we have replaced special characters in the data with NaN values, we can use the fillna() function in pandas to replace the NaN values with the mean of the column.

In [10]:
# impute the 'NaN' with the mean of the values
for i in range(0,1557):
    X[i] = X[i].fillna(X[i].mean())
print(X.head(15))

          0           1         2     3     4     5     6     7     8     \
0   125.000000  125.000000  1.000000   1.0   0.0   0.0   0.0   0.0   0.0   
1    57.000000  468.000000  8.210500   1.0   0.0   0.0   0.0   0.0   0.0   
2    33.000000  230.000000  6.969600   1.0   0.0   0.0   0.0   0.0   0.0   
3    60.000000  468.000000  7.800000   1.0   0.0   0.0   0.0   0.0   0.0   
4    60.000000  468.000000  7.800000   1.0   0.0   0.0   0.0   0.0   0.0   
5    60.000000  468.000000  7.800000   1.0   0.0   0.0   0.0   0.0   0.0   
6    59.000000  460.000000  7.796600   1.0   0.0   0.0   0.0   0.0   0.0   
7    60.000000  234.000000  3.900000   1.0   0.0   0.0   0.0   0.0   0.0   
8    60.000000  468.000000  7.800000   1.0   0.0   0.0   0.0   0.0   0.0   
9    60.000000  468.000000  7.800000   1.0   0.0   0.0   0.0   0.0   0.0   
10   64.021886  155.344828  3.911953   1.0   0.0   0.0   0.0   0.0   0.0   
11   90.000000   52.000000  0.577700   1.0   0.0   0.0   0.0   0.0   0.0   
12   90.0000

In [11]:
# scale data using MinMaxScaler; scaling data is useful in the modeling step
from sklearn.preprocessing import MinMaxScaler
minmaxScaler = MinMaxScaler()

# transforming with the scaler function
X_tran = pd.DataFrame(minmaxScaler.fit_transform(X))
print(X_tran.head())

       0         1         2     3     4     5     6     7     8     9     \
0  0.194053  0.194053  0.016642   1.0   0.0   0.0   0.0   0.0   0.0   0.0   
1  0.087637  0.730829  0.136820   1.0   0.0   0.0   0.0   0.0   0.0   0.0   
2  0.050078  0.358372  0.116138   1.0   0.0   0.0   0.0   0.0   0.0   0.0   
3  0.092332  0.730829  0.129978   1.0   0.0   0.0   0.0   0.0   0.0   0.0   
4  0.092332  0.730829  0.129978   1.0   0.0   0.0   0.0   0.0   0.0   0.0   

   ...  1548  1549  1550  1551  1552  1553  1554  1555  1556  1557  
0  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
1  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
2  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
3  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  
4  ...   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  

[5 rows x 1558 columns]


In [12]:
# creating high dimension data by factor of 50
X_hd = pd.DataFrame(np.tile(X_tran, (1, 50)))
print(X_hd.shape)

(3279, 77900)


In [13]:
# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
                                    X_hd, y, test_size=0.3, random_state=123)

An important step in factor analysis is defining the number of factors in a dataset. This step is achieved through experimentation. In our case, we will arbitrarily assume that there are 20 factors.

In [14]:
from sklearn.decomposition import FactorAnalysis

In [15]:
# defining the number of factors
fa = FactorAnalysis(n_components=20, random_state=123)

Once the factor method is defined, we will fit the method on the training set and also transform the training set to get a new training set with the required number of factors. We will also note the time it takes to fit the required number of factors

In [16]:
# fitting factor analysis method and transforming training set
t0 = time.time()
X_fac = fa.fit_transform(X_train)
t1 = time.time()
print('Factor analysis fitting time: {}s'.format(
        round(t1-t0, 3)))

Factor analysis fitting time: 32.864s


In the code, the .fit() function is used to fit on the training set, and the transform() method is used to get a new training set with the required number of factors

In [17]:
# transforming the test set
X_test_fac = fa.transform(X_test)

In [18]:
# shape of train and test sets before/after transformation
print('orginal shape of Training set: {}'.format(X_train.shape))
print('orginal shape of Testing set: {}'.format(X_test.shape))
print('transformed shape of Training set: {}'.format(X_fac.shape))
print('transformed shape of Testing set: {}'.format(X_test_fac.shape))

orginal shape of Training set: (2295, 77900)
orginal shape of Testing set: (984, 77900)
transformed shape of Training set: (2295, 20)
transformed shape of Testing set: (984, 20)


In [19]:
# fit logistic regression model and note the time
facModel = LogisticRegression()
t0 = time.time()
facModel.fit(X_fac, y_train)
t1 = time.time()
print('total fitting time: {}s'.format(round(t1-t0, 3)))

total fitting time: 0.025s


In [20]:
# predicting with the factor analysis model
pred = facModel.predict(X_test_fac)

# accuracy
print('accuracy of factor analysis: {:.2f}'.format(facModel.score(X_test_fac, y_test)))

accuracy of factor analysis: 0.92


In [21]:
# confusion matrix
confusionmatrix = confusion_matrix(y_test, pred)
print(confusionmatrix)

# classification report
print(classification_report(y_test, pred))

[[ 48  78]
 [  0 858]]
              precision    recall  f1-score   support

         ad.       1.00      0.38      0.55       126
      nonad.       0.92      1.00      0.96       858

    accuracy                           0.92       984
   macro avg       0.96      0.69      0.75       984
weighted avg       0.93      0.92      0.90       984

