# Handling Missing Values with Pandas - Class Activity

## Instructions given by professor

The tasks are the following:
- Load Data
- Delete features with more than 70% missing values
- Delete observations with more than 50% missing values
- Check for duplicated rows
- Delete features with average equal to Zero
- Identify Type of data of each feature
- Validate that Numbers are numeric values
- Identify Categorical Features --> Generate the corresponding Dummy columns

* Finally, you should execute the same code, now loading the dataset called "test.csv"

The Jupyter Notebook (.ipynb) should be submited next to the export of such code, as html or PDF, with all code executed and outputs visible.

Be sure to alternate code and text cells, with the text cells explaining the code and results obtained.

## Development

### Load Data
In this part, we import al the libraries that we are going to need and the dataframe that we are going to analize.

In [27]:
import pandas as pd
df = pd.read_csv('../DataBases/ModalidadVirtual.csv',delimiter=',')
print(df.info()) #With method ".info()" we can summarize the data frame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 225 entries, 0 to 224
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  225 non-null    int64 
 1   time        225 non-null    object
 2   carrera     225 non-null    object
 3   acepta      225 non-null    object
 4   positivo    225 non-null    object
 5   negativo    225 non-null    object
 6   edad        225 non-null    int64 
 7   sexo        225 non-null    object
 8   trabajo     225 non-null    object
dtypes: int64(2), object(7)
memory usage: 15.9+ KB
None


### Delete features with more than 70% missing values
To remove features/columns with more than 70% missing values, the following procedure was followed:

1. Declare a list of the dataframe's column names.
2. Iterate through the previously created list in a for loop to create a variable with the information from each column of the dataframe.
3. Create a variable to store the number of data points in the column.
4. Create a variable to store the number of missing values in the column.
5. Calculate the percentage of missing values.
6. If the percentage of missing values is greater than 70%, remove the feature from the dataframe.

In [28]:
headers_list = df.columns.tolist()
for header in headers_list:
    feature = df[header]
    len_feature = len(feature)
    num_missing_features = (feature.isnull()).sum()
    percentage_missing_features = (num_missing_features * 100) / len_feature
    if percentage_missing_features >= 70: df = df.drop(header,axis=1)

### Delete observations with more than 50% missing values
To remove observations/rows with more than 50% missing values, the following procedure was followed:

1. Declare a variable with the number of observations in the dataframe.
2. Iterate through a for loop, where the variable 'row' is declared as an index to identify each observation.
3. Create a variable to store each observation.
4. Declare the variable of the number of data points in that observation and another variable that stores the number of missing data points in the same.
5. Create a variable that stores the percentage of missing values in the dataframe.
6. If the percentage of missing values in each observation is greater than 50%, then that observation is deleted from the dataframe.

In [29]:
len_df = len(df)
for row in range(len_df):
    observation = df.loc[row]
    len_observation = len(observation)
    num_missing_observations = (observation.isnull()).sum()
    percentage_missing_observations = (num_missing_observations * 100) / len_observation
    if percentage_missing_observations >= 50: df = df.drop([row], axis=0)

### Check for duplicated rows
To remove duplicate rows:

1. Create a list to store the indices of the duplicate rows and create the variable responsible for updating its value according to the current index.
2. In a for loop, iterate over the observations in the dataframe to identify repeated rows and return 'True ' if any row is repeated.
3. If the row is repeated, then add the index to the list of duplicate rows.
4. Remove all rows from the dataframe that are in the previously created list.

In [30]:
duplicated_row_list = []
iteration_index = 0
for row in df.duplicated():
    if row: duplicated_row_list.append(iteration_index)
    iteration_index += 1
df = df.drop(duplicated_row_list, axis=0)

### Delete features with Average = 0
In a for loop, the list of dataframe headers is iterated over and the process for each iteration is as follows:

1. Create a list of possible values for each feature.
2. Create a variable to store the information of the feature to be analyzed.
3. In a nested for loop, the feature is iterated over to access each observation of the same.
4. If the value of our observation is not in the list of possible values, then add that value to the list.
5. If the length of the list of possible values is the same size as our observations, it means that that feature receives unique data, so we discard that feature and move on to the next one.
6. If there are fewer possible values than the length of the dataframe, then a variable is created to store the average of that feature.
7. A for loop is used to iterate over the list of possible values of the feature and the number of times that data repeats is summed.
8. With the sum of how many times the data appears in the dataframe, it is divided by the length of the list of possible values to obtain the average of the feature.
9. If the average of the feature is equal to 0, then that feature is removed from the dataframe.

In [31]:
headers_list = df.columns.tolist()
for header in headers_list:
    possible_value_list = []
    feature = df[header]
    for observation in feature:
        if not(observation in possible_value_list): possible_value_list.append(observation)
    if len(possible_value_list) == len(feature): continue
    average = 0
    for value in possible_value_list: average += (df[header] == value).sum()
    average /= len(possible_value_list)
    if average == 0: df = df.drop(header,axis=1)

### Identify type of data of each feature
To identify the data type of each feature, a for loop is used to iterate over the list of headers to print the information of each feature separately. This facilitates the analysis.

In [32]:
for header in headers_list:
    print(f'{df[header].info()}\n\n{"*"*50}')

<class 'pandas.core.series.Series'>
RangeIndex: 222 entries, 0 to 221
Series name: Unnamed: 0
Non-Null Count  Dtype
--------------  -----
222 non-null    int64
dtypes: int64(1)
memory usage: 1.9 KB
None

**************************************************
<class 'pandas.core.series.Series'>
RangeIndex: 222 entries, 0 to 221
Series name: time
Non-Null Count  Dtype 
--------------  ----- 
222 non-null    object
dtypes: object(1)
memory usage: 1.9+ KB
None

**************************************************
<class 'pandas.core.series.Series'>
RangeIndex: 222 entries, 0 to 221
Series name: carrera
Non-Null Count  Dtype 
--------------  ----- 
222 non-null    object
dtypes: object(1)
memory usage: 1.9+ KB
None

**************************************************
<class 'pandas.core.series.Series'>
RangeIndex: 222 entries, 0 to 221
Series name: acepta
Non-Null Count  Dtype 
--------------  ----- 
222 non-null    object
dtypes: object(1)
memory usage: 1.9+ KB
None

*****************************

### Validate that numbers are numeric Values
The process of validating numeric data of features is as follows:

1. The list of dataframe headers is iterated over in a for loop.
2. Three variables are created to keep track of the number of integers, floats, and any other type of data.
3. The information of the dataframe is stored for each feature.
4. The feature is iterated over to access each observation, and then a validation is performed that updates the value of the three variables that keep track of integers, floats, or other types of data, depending on the data type of the observation.
5. If the sum of the total integers and floats is not greater than the count of any other type of data, we can infer that that feature should be something other than numeric values, so we omit that feature and continue with the next one.
6. A validation is performed depending on which value is greater, the count of integers or floats, in order to then unify the data type of the entire feature.



In [33]:
headers_list = df.columns.tolist()
for header in headers_list:
    int_count, float_count, other_count = 0, 0, 0
    feature = df[header]
    for observation in feature:
        if (type(observation) == int) or (type(observation) == float): 
            int_count += 1 if type(observation) == int else 0
            float_count += 1 if type(observation) == float else 0
        else: other_count += 1
    if not((int_count+float_count) > other_count): continue
    if int_count > float_count:
        df[header] = df[header].astype('int64')
    if float_count > int_count:
        df[header] = df[header].astype('float')

### Identify categorical features → Generate dummies
To identify categorical features, the ".get_dummies()" method is used, which is incorporated with pandas. The parameters needed are the dataframe and the column to generate the dummy.

In [34]:
print(f'{pd.get_dummies(df,columns=[header])}')

     Unnamed: 0        time                       carrera acepta  \
0             0  2020-11-08        Ingeniería de Sistemas     Si   
1             1  2020-11-08                    Psicología     Si   
2             2  2020-11-08        Ingeniería de Sistemas     Si   
3             3  2020-11-08        Ingeniería de Sistemas     Si   
4             4  2020-11-08        Ingeniería de Sistemas     Si   
..          ...         ...                           ...    ...   
217         230  2020-12-10  Gestión Turística y Hotelera     Si   
218         231  2020-12-11        Ingeniería de Sistemas     No   
219         232  2020-12-11  Gestión Turística y Hotelera     No   
220         233  2020-12-11         Ingeniería Agronómica     Si   
221         234  2020-12-12        Comercio Internacional     Si   

                          positivo  \
0                Horario flexible.   
1    Acceso desde cualquier lugar.   
2                Horario flexible.   
3                Horario flexib

## Made By
- Diego Monroy Minero
- Sergio Johanan Barrera Chan
- Juan Antonio Cel Vazquez
- Ariel Joel Buenfil Góngora