When objects in a dataset are represented as integers, there are several steps you can take to ensure that the data is properly understood and utilized. Here are some approaches:

1. Understand the Context
Identify the Meaning: Determine what each integer represents. For example, integers might correspond to categories, labels, or ordinal values.
Documentation: Check any accompanying documentation or metadata that explains the mapping of integers to their respective objects.
2. Data Transformation
Categorical Encoding: If the integers represent categories, you might want to convert them into categorical data. This can be done using techniques like:
Label Encoding: Assigning each integer a unique label.
One-Hot Encoding: Creating binary columns for each category.
Normalization/Standardization: If the integers represent ordinal data or continuous values, consider normalizing or standardizing the data to improve the performance of machine learning models.
3. Visualization
Histograms: Plot histograms to understand the distribution of the integers.
Bar Charts: Use bar charts if the integers represent categorical data to visualize the frequency of each category.
4. Statistical Analysis
Descriptive Statistics: Calculate mean, median, mode, standard deviation, etc., to get a sense of the data's central tendency and dispersion.
Correlation Analysis: If you have multiple features, check for correlations between the integer-represented objects and other features.
Example in Python

Here's a simple example using Python and pandas to handle a dataset where objects are represented as integers:

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Sample dataset
data = {'Category': [1, 2, 1, 3, 2, 1]}
df = pd.DataFrame(data)

# Label Encoding
label_encoder = LabelEncoder()
df['Category_Label'] = label_encoder.fit_transform(df['Category'])

# One-Hot Encoding
one_hot_encoder = OneHotEncoder(sparse=False)
one_hot_encoded = one_hot_encoder.fit_transform(df[['Category']])
df_one_hot = pd.DataFrame(one_hot_encoded, columns=one_hot_encoder.get_feature_names_out(['Category']))

# Combine original dataframe with one-hot encoded dataframe
df_combined = pd.concat([df, df_one_hot], axis=1)

print(df_combined)


This code snippet demonstrates how to perform label encoding and one-hot encoding on a dataset where the 'Category' column is represented by integers.

By following these steps, you can effectively manage and utilize datasets where objects are represented as integers.

In [1]:
import pandas as pd 
import numpy as np

import matplotlib.pyplot as plt 
import seaborn as sns

In [2]:
df = pd.read_csv("Dropout dataset.csv")

In [3]:
df.select_dtypes(include="object")

Unnamed: 0,Target
0,Dropout
1,Graduate
2,Dropout
3,Graduate
4,Graduate
...,...
4419,Graduate
4420,Dropout
4421,Dropout
4422,Graduate


In [4]:
df.select_dtypes(exclude="object")

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 1st sem (without evaluations),Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0,0.000000,0,10.8,1.4,1.74
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,0,6,6,6,13.666667,0,13.9,-0.3,0.79
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,0,6,0,0,0.000000,0,10.8,1.4,1.74
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,0,6,10,5,12.400000,0,9.4,-0.8,-3.12
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,0,6,6,6,13.000000,0,13.9,-0.3,0.79
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4419,1,1,6,9773,1,1,125.0,1,1,1,...,0,0,6,8,5,12.666667,0,15.5,2.8,-4.06
4420,1,1,2,9773,1,1,120.0,105,1,1,...,0,0,6,6,2,11.000000,0,11.1,0.6,2.02
4421,1,1,1,9500,1,1,154.0,1,37,37,...,0,0,8,9,1,13.500000,0,13.9,-0.3,0.79
4422,1,1,1,9147,1,1,180.0,1,37,37,...,0,0,5,6,5,12.000000,0,9.4,-0.8,-3.12


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4424 entries, 0 to 4423
Data columns (total 37 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   Marital status                                  4424 non-null   int64  
 1   Application mode                                4424 non-null   int64  
 2   Application order                               4424 non-null   int64  
 3   Course                                          4424 non-null   int64  
 4   Daytime/evening attendance	                     4424 non-null   int64  
 5   Previous qualification                          4424 non-null   int64  
 6   Previous qualification (grade)                  4424 non-null   float64
 7   Nacionality                                     4424 non-null   int64  
 8   Mother's qualification                          4424 non-null   int64  
 9   Father's qualification                   

In [6]:
df[["Marital status","Application mode","Course","Previous qualification","Nacionality","Mother's qualification","Father's qualification","Mother's occupation","Father's occupation"]] = df[["Marital status","Application mode","Course","Previous qualification","Nacionality","Mother's qualification","Father's qualification","Mother's occupation","Father's occupation"]].astype(str) 

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4424 entries, 0 to 4423
Data columns (total 37 columns):
 #   Column                                          Non-Null Count  Dtype  
---  ------                                          --------------  -----  
 0   Marital status                                  4424 non-null   object 
 1   Application mode                                4424 non-null   object 
 2   Application order                               4424 non-null   int64  
 3   Course                                          4424 non-null   object 
 4   Daytime/evening attendance	                     4424 non-null   int64  
 5   Previous qualification                          4424 non-null   object 
 6   Previous qualification (grade)                  4424 non-null   float64
 7   Nacionality                                     4424 non-null   object 
 8   Mother's qualification                          4424 non-null   object 
 9   Father's qualification                   

In [8]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Application order,4424.0,1.727848,1.313793,0.0,1.0,1.0,2.0,9.0
Daytime/evening attendance\t,4424.0,0.890823,0.311897,0.0,1.0,1.0,1.0,1.0
Previous qualification (grade),4424.0,132.613314,13.188332,95.0,125.0,133.1,140.0,190.0
Admission grade,4424.0,126.978119,14.482001,95.0,117.9,126.1,134.8,190.0
Displaced,4424.0,0.548373,0.497711,0.0,0.0,1.0,1.0,1.0
Educational special needs,4424.0,0.011528,0.10676,0.0,0.0,0.0,0.0,1.0
Debtor,4424.0,0.113698,0.31748,0.0,0.0,0.0,0.0,1.0
Tuition fees up to date,4424.0,0.880651,0.324235,0.0,1.0,1.0,1.0,1.0
Gender,4424.0,0.351718,0.47756,0.0,0.0,0.0,1.0,1.0
Scholarship holder,4424.0,0.248418,0.432144,0.0,0.0,0.0,0.0,1.0


In [9]:
#df_nums = df.select_dtypes(exclude='object')
#df_objs = df.select_dtypes(include='object')

In [10]:
#df_objs = pd.get_dummies(df_objs,drop_first=True)

In [11]:
#final_df = pd.concat([df_nums,df_objs],axis=1)

In [12]:
#final_df

In [13]:
df.select_dtypes(include='object')

Unnamed: 0,Marital status,Application mode,Course,Previous qualification,Nacionality,Mother's qualification,Father's qualification,Mother's occupation,Father's occupation,Target
0,1,17,171,1,1,19,12,5,9,Dropout
1,1,15,9254,1,1,1,3,3,3,Graduate
2,1,1,9070,1,1,37,37,9,9,Dropout
3,1,17,9773,1,1,38,37,5,3,Graduate
4,2,39,8014,1,1,37,38,9,9,Graduate
...,...,...,...,...,...,...,...,...,...,...
4419,1,1,9773,1,1,1,1,5,4,Graduate
4420,1,1,9773,1,105,1,1,9,9,Dropout
4421,1,1,9500,1,1,37,37,9,9,Dropout
4422,1,1,9147,1,1,37,37,7,4,Graduate


In [14]:
df["Father\'s qualification"]

0       12
1        3
2       37
3       37
4       38
        ..
4419     1
4420     1
4421    37
4422    37
4423    37
Name: Father's qualification, Length: 4424, dtype: object

In [15]:
df 

Unnamed: 0,Marital status,Application mode,Application order,Course,Daytime/evening attendance\t,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.000000,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.000000,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.400000,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.000000,0,13.9,-0.3,0.79,Graduate
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4419,1,1,6,9773,1,1,125.0,1,1,1,...,0,6,8,5,12.666667,0,15.5,2.8,-4.06,Graduate
4420,1,1,2,9773,1,1,120.0,105,1,1,...,0,6,6,2,11.000000,0,11.1,0.6,2.02,Dropout
4421,1,1,1,9500,1,1,154.0,1,37,37,...,0,8,9,1,13.500000,0,13.9,-0.3,0.79,Dropout
4422,1,1,1,9147,1,1,180.0,1,37,37,...,0,5,6,5,12.000000,0,9.4,-0.8,-3.12,Graduate


In [16]:
df_nums = df.select_dtypes(exclude='object')
df_objs = df.select_dtypes(include='object')

In [17]:
df_objs = df_objs.drop("Target", axis=1)

In [18]:
df_output = df["Target"]

In [19]:
df_objs = pd.get_dummies(df_objs,drop_first=True)

In [20]:
final_df = pd.concat([df_nums,df_objs,df_output],axis=1)

In [21]:
final_df

Unnamed: 0,Application order,Daytime/evening attendance\t,Previous qualification (grade),Admission grade,Displaced,Educational special needs,Debtor,Tuition fees up to date,Gender,Scholarship holder,...,Father's occupation_3,Father's occupation_4,Father's occupation_5,Father's occupation_6,Father's occupation_7,Father's occupation_8,Father's occupation_9,Father's occupation_90,Father's occupation_99,Target
0,5,1,122.0,127.3,1,0,0,1,1,0,...,0,0,0,0,0,0,1,0,0,Dropout
1,1,1,160.0,142.5,1,0,0,0,1,0,...,1,0,0,0,0,0,0,0,0,Graduate
2,5,1,122.0,124.8,1,0,0,0,1,0,...,0,0,0,0,0,0,1,0,0,Dropout
3,2,1,122.0,119.6,1,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,Graduate
4,1,0,100.0,141.5,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,Graduate
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4419,6,1,125.0,122.2,0,0,0,1,1,0,...,0,1,0,0,0,0,0,0,0,Graduate
4420,2,1,120.0,119.0,1,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,Dropout
4421,1,1,154.0,149.5,1,0,0,1,0,1,...,0,0,0,0,0,0,1,0,0,Dropout
4422,1,1,180.0,153.8,1,0,0,1,0,1,...,0,1,0,0,0,0,0,0,0,Graduate


In [22]:
final_df.to_csv("dropout_dataset_final_with_target.csv")