## Email Dataset Exploration

### Overview:
In this phase, we explore a variety of publicly available email datasets to identify the one most suitable for building a reliable spam classification model. Each dataset will be evaluated based on the quality and relevance of the features it provides, such as sender/receiver details, subject lines, email content, and spam labels.

### Key Steps:
- **Dataset Selection**: Review multiple datasets to assess completeness, size, and diversity of email data.
- **Feature Analysis**: Examine the available features (sender info, subject, body text) for their potential contribution to spam classification.
- **Spam Label Quality**: Ensure the dataset contains accurate and well-defined spam labels for training and evaluation purposes.

Once a suitable dataset is identified, it will be preprocessed and prepared for model training.

### Importing Required Libraries

In [1]:
import pandas as pd

## Enron Datasets

## Dataset 1: Enron Datasets data.csv

### Loading and Printing  Dataset

In [12]:
df_1 = pd.read_csv(r"Dataset\Extra Dataset\Enron Datasets\data.csv")

print("Dataset is:\n")
print(df_1.head())

Dataset is:

      Name  Emails  Year
0    Sally    3000  1999
1    Sally    4000  2001
2  Richard    2000  2000
3  Richard    6000  2001
4   Mathew    5000  1999


### Seeing the Shape of the Dataset

In [7]:
print("Shape of the Dataset is:\n")

print(df_1.shape)

Shape of the Dataset is:

(14, 3)


### Seeing the Columns of the Dataset

In [5]:
print("Columns of the Dataset are:\n")

print(df_1.columns)

Columns of the Dataset are:

Index(['Name', 'Emails', 'Year'], dtype='object')


### Seeing the Information of the Dataset

In [6]:
print("Information of the Dataset is:\n")

print(df_1.info())

Information of the Dataset is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    14 non-null     object
 1   Emails  14 non-null     int64 
 2   Year    14 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 468.0+ bytes
None


## Dataset 2: Enron Datasets enron_org.csv

### Loading and Printing  Dataset

In [13]:
df_2 = pd.read_csv(r"Dataset\Extra Dataset\Enron Datasets\enron_org.csv")

print("Dataset is:\n")
print(df_2.head())

Dataset is:

                  Email_id    status
0    marie.heard@enron.com       NaN
1  mark.e.taylor@enron.com  Employee
2   lindy.donoho@enron.com  Employee
3      lisa.gang@enron.com       NaN
4  jeff.skilling@enron.com       CEO


### Seeing the Shape of the Dataset

In [9]:
print("Shape of the Dataset is:\n")

print(df_2.shape)

Shape of the Dataset is:

(149, 2)


### Seeing the Columns of the Dataset

In [10]:
print("Columns of the Dataset are:\n")

print(df_2.columns)

Columns of the Dataset are:

Index(['Email_id', 'status'], dtype='object')


### Seeing the Information of the Dataset

In [11]:
print("Information of the Dataset is:\n")

print(df_2.info())

Information of the Dataset is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Email_id  149 non-null    object
 1   status    117 non-null    object
dtypes: object(2)
memory usage: 2.5+ KB
None


## Dataset 3: Enron Datasets enron_sentiments.csv

### Loading and Printing  Dataset

In [19]:
df_3 = pd.read_csv(r"Dataset\Extra Dataset\Enron Datasets\enron_sentiments.csv", sep='\t')

print("Dataset is:\n")

print(df_3.head())

Dataset is:

         date  close   high    low  Sentiments               Employee
0  01/10/2001  29.15  29.45  27.05    1.527778  kenneth.lay@enron.com
1  02/10/2001  30.61  30.77  28.70    1.546296  kenneth.lay@enron.com
2  03/10/2001  33.49  34.44  31.20    2.291667  kenneth.lay@enron.com
3  04/10/2001  33.10  34.74  32.75    1.388889  kenneth.lay@enron.com
4  05/10/2001  31.73  34.49  31.58    1.562500  kenneth.lay@enron.com


### Seeing the Shape of the Dataset

In [20]:
print("Shape of the Dataset is:\n")

print(df_3.shape)

Shape of the Dataset is:

(6484, 6)


### Seeing the Columns of the Dataset

In [21]:
print("Columns of the Dataset are:\n")

print(df_3.columns)

Columns of the Dataset are:

Index(['date', 'close', 'high', 'low', 'Sentiments', 'Employee'], dtype='object')


### Seeing the Information of the Dataset

In [22]:
print("Information of the Dataset is:\n")

print(df_3.info())

Information of the Dataset is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6484 entries, 0 to 6483
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        479 non-null    object 
 1   close       384 non-null    float64
 2   high        384 non-null    float64
 3   low         384 non-null    float64
 4   Sentiments  398 non-null    float64
 5   Employee    479 non-null    object 
dtypes: float64(4), object(2)
memory usage: 304.1+ KB
None


## Dataset 4: Enron Datasets enronClean.csv

### Loading and Printing  Dataset

In [24]:
df_4 = pd.read_csv(r"Dataset\Extra Dataset\Enron Datasets\enronClean.csv")

print("Dataset is:\n")

print(df_4.head())

Dataset is:

               TID               MID             SUBJECT  \
0  e94a22508dac953   e94a22508dac953   "FW: LINE SM-123"   
1  e94a22508dac953   e94a22508dac953   "FW: LINE SM-123"   
2  e94a22508dac953   e94a22508dac953   "FW: LINE SM-123"   
3  e94a22508dac953   e94a22508dac953   "FW: LINE SM-123"   
4  e94a22508dac953   e94a22508dac953   "FW: LINE SM-123"   

                         FROM                   TIMESTAMP  \
0   victor.lamadrid@enron.com   2001-10-01T14:19:03-07:00   
1   victor.lamadrid@enron.com   2001-10-01T14:19:03-07:00   
2   victor.lamadrid@enron.com   2001-10-01T14:19:03-07:00   
3   victor.lamadrid@enron.com   2001-10-01T14:19:03-07:00   
4   victor.lamadrid@enron.com   2001-10-01T14:19:03-07:00   

                          TO TYPE  
0       john.hodge@enron.com   TO  
1      john.singer@enron.com   TO  
2       scott.neal@enron.com   TO  
3  clarissa.garcia@enron.com   TO  
4    chris.germany@enron.com   TO  


### Seeing the Shape of the Dataset

In [25]:
print("Shape of the Dataset is:\n")

print(df_4.shape)

Shape of the Dataset is:

(526390, 7)


### Seeing the Columns of the Dataset

In [26]:
print("Columns of the Dataset are:\n")

print(df_4.columns)

Columns of the Dataset are:

Index(['TID', 'MID', 'SUBJECT', 'FROM', 'TIMESTAMP', 'TO', 'TYPE'], dtype='object')


### Seeing the Information of the Dataset

In [27]:
print("Information of the Dataset is:\n")

print(df_4.info())

Information of the Dataset is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526390 entries, 0 to 526389
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   TID        526390 non-null  object
 1   MID        526390 non-null  object
 2   SUBJECT    526390 non-null  object
 3   FROM       526390 non-null  object
 4   TIMESTAMP  526390 non-null  object
 5   TO         526388 non-null  object
 6   TYPE       526390 non-null  object
dtypes: object(7)
memory usage: 28.1+ MB
None


## Dataset 5: Enron Datasets EnronEmployeeInformation.csv

### Loading and Printing  Dataset

In [35]:
df_5 = pd.read_csv(r"Dataset\Extra Dataset\Enron Datasets\EnronEmployeeInformation.csv", index_col = None)

print("Dataset is:\n")

print(df_5.head())

Dataset is:

   Unnamed: 0  ID           Name  EmailID    Folder         Department  \
0           1   1    John Arnold  jarnold  arnold-j  ENA Gas Financial   
1           2   2    Harry Arora   harora   arora-h     ENA East Power   
2           3   3  Robert Badeer  rbadeer  badeer-r     ENA West Power   
3           4   4   Susan Bailey  sbaile2  bailey-s          ENA Legal   
4           5   5      Eric Bass    ebass    bass-e      ENA Gas Texas   

              Title  
0        VP Trading  
1        VP Trading  
2       Mgr Trading  
3  Specialist Legal  
4            Trader  


### Seeing the Shape of the Dataset

In [34]:
print("Shape of the Dataset is:\n")

print(df_5.shape)

Shape of the Dataset is:

(156, 7)


### Seeing the Columns of the Dataset

In [30]:
print("Columns of the Dataset are:\n")

print(df_5.columns)

Columns of the Dataset are:

Index(['Unnamed: 0', 'ID', 'Name', 'EmailID', 'Folder', 'Department', 'Title'], dtype='object')


### Seeing the Information of the Dataset

In [36]:
print("Information of the Dataset is:\n")

print(df_5.info())

Information of the Dataset is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  156 non-null    int64 
 1   ID          156 non-null    int64 
 2   Name        156 non-null    object
 3   EmailID     155 non-null    object
 4   Folder      156 non-null    object
 5   Department  153 non-null    object
 6   Title       156 non-null    object
dtypes: int64(2), object(5)
memory usage: 8.7+ KB
None


## Dataset 6: Enron Datasets enronThread2001.csv

### Loading and Printing  Dataset

In [37]:
df_6 = pd.read_csv(r"Dataset\Extra Dataset\Enron Datasets\enronThread2001.csv")

print("Dataset is:\n")

print(df_6.head())

Dataset is:

               TID               MID             SUBJECT  \
0  e94a22508dac953   e94a22508dac953   "FW: LINE SM-123"   
1  e94a22508dac953   e94a22508dac953   "FW: LINE SM-123"   
2  e94a22508dac953   e94a22508dac953   "FW: LINE SM-123"   
3  e94a22508dac953   e94a22508dac953   "FW: LINE SM-123"   
4  e94a22508dac953   e94a22508dac953   "FW: LINE SM-123"   

                         FROM                   TIMESTAMP  \
0   victor.lamadrid@enron.com   2001-10-01T14:19:03-07:00   
1   victor.lamadrid@enron.com   2001-10-01T14:19:03-07:00   
2   victor.lamadrid@enron.com   2001-10-01T14:19:03-07:00   
3   victor.lamadrid@enron.com   2001-10-01T14:19:03-07:00   
4   victor.lamadrid@enron.com   2001-10-01T14:19:03-07:00   

                          TO TYPE  
0       john.hodge@enron.com   TO  
1      john.singer@enron.com   TO  
2       scott.neal@enron.com   TO  
3  clarissa.garcia@enron.com   TO  
4    chris.germany@enron.com   TO  


### Seeing the Shape of the Dataset

In [38]:
print("Shape of the Dataset is:\n")

print(df_6.shape)

Shape of the Dataset is:

(526390, 7)


### Seeing the Columns of the Dataset

In [39]:
print("Columns of the Dataset are:\n")

print(df_6.columns)

Columns of the Dataset are:

Index(['TID', 'MID', 'SUBJECT', 'FROM', 'TIMESTAMP', 'TO', 'TYPE'], dtype='object')


### Seeing the Information of the Dataset

In [40]:
print("Information of the Dataset is:\n")

print(df_6.info())

Information of the Dataset is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 526390 entries, 0 to 526389
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   TID        526390 non-null  object
 1   MID        526390 non-null  object
 2   SUBJECT    526390 non-null  object
 3   FROM       526390 non-null  object
 4   TIMESTAMP  526390 non-null  object
 5   TO         526388 non-null  object
 6   TYPE       526390 non-null  object
dtypes: object(7)
memory usage: 28.1+ MB
None


## Dataset 7: Enron Datasets FinalAdjacencyMatrix.csv

### Loading and Printing  Dataset

In [41]:
df_7 = pd.read_csv(r"Dataset\Extra Dataset\Enron Datasets\FinalAdjacencyMatrix.csv")

print("Dataset is:\n")

print(df_7.head())

Dataset is:

               name  Phillip K. Allen  John Arnold  Harry Arora  \
0  Phillip K. Allen               0.0          0.0            0   
1       John Arnold               1.0         42.0            0   
2       Harry Arora               0.0          0.0            2   
3     Robert Badeer               0.0          0.0            0   
4      Susan Bailey               0.0          0.0            0   

   Robert Badeer  Susan Bailey  Eric Bass  Don Baughman Jr.  Sally Beck  \
0            0.0           0.0        0.0                 0         0.0   
1            0.0           0.0        0.0                 0         0.0   
2            0.0           0.0        0.0                 0         0.0   
3            0.0           0.0        0.0                 0         0.0   
4            0.0           0.0        0.0                 0         0.0   

   Robert Benson  ...  V Charles Weldon  Greg Whalley  Stacey W. White  \
0              0  ...               0.0           3.0      

### Seeing the Shape of the Dataset

In [44]:
print("Shape of the Dataset is:\n")

print(df_7.shape)

Shape of the Dataset is:

(156, 157)


### Seeing the Columns of the Dataset

In [43]:
print("Columns of the Dataset are:\n")

print(df_7.columns)

Columns of the Dataset are:

Index(['name', 'Phillip K. Allen', 'John Arnold', 'Harry Arora',
       'Robert Badeer', 'Susan Bailey', 'Eric Bass', 'Don Baughman Jr.',
       'Sally Beck', 'Robert Benson',
       ...
       'V Charles Weldon', 'Greg Whalley', 'Stacey W. White', 'Mark Whitt',
       'Jason Williams', 'Bill Williams III', 'Jason Wolfe', 'Paul Y'Barbo',
       'Andy Zipper', 'John Zufferli'],
      dtype='object', length=157)


### Seeing the Information of the Dataset

In [42]:
print("Information of the Dataset is:\n")

print(df_7.info())

Information of the Dataset is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Columns: 157 entries, name to John Zufferli
dtypes: float64(125), int64(31), object(1)
memory usage: 191.5+ KB
None


## Dataset 8: Enron Datasets Sentiments_employees.csv

### Loading and Printing  Dataset

In [50]:
# Step 1: Load the CSV without specifying a separator
df_8 = pd.read_csv(r"Dataset\Extra Dataset\Enron Datasets\Sentiments_employees.csv", delimiter=None, engine='python')

# Step 2: Print the dataset and check how it's loaded
print("Raw Dataset:\n")
print(df_8.head())

# Step 3: Split the columns if they are not split correctly (optional step based on the output)
df_8 = df_8['date\tclose\thigh\tlow\tSentiments\tEmployee'].str.split('\t', expand=True)

# Step 4: Print the cleaned dataset
print("\nCleaned Dataset:\n")

print(df_8.head())

Raw Dataset:

        date\tclose\thigh\tlow\tSentiments\tEmployee
0  04/01/1999\t59.25\t59.5\t57.5\t\tkenneth.lay@e...
1  05/01/1999\t59\t60.13\t58.88\t\tkenneth.lay@en...
2  06/01/1999\t59.44\t59.75\t58.44\t\tkenneth.lay...
3  07/01/1999\t62\t62.88\t58.63\t\tkenneth.lay@en...
4  08/01/1999\t65.13\t65.38\t62.25\t\tkenneth.lay...

Cleaned Dataset:

            0      1      2      3 4                      5
0  04/01/1999  59.25   59.5   57.5    kenneth.lay@enron.com
1  05/01/1999     59  60.13  58.88    kenneth.lay@enron.com
2  06/01/1999  59.44  59.75  58.44    kenneth.lay@enron.com
3  07/01/1999     62  62.88  58.63    kenneth.lay@enron.com
4  08/01/1999  65.13  65.38  62.25    kenneth.lay@enron.com


### Seeing the Shape of the Dataset

In [51]:
print("Shape of the Dataset is:\n")

print(df_8.shape)

Shape of the Dataset is:

(6483, 6)


### Seeing the Columns of the Dataset

In [52]:
print("Columns of the Dataset are:\n")

print(df_8.columns)

Columns of the Dataset are:

RangeIndex(start=0, stop=6, step=1)


### Seeing the Information of the Dataset

In [53]:
print("Information of the Dataset is:\n")

print(df_8.info())

Information of the Dataset is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6483 entries, 0 to 6482
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       6483 non-null   object
 1   1       6483 non-null   object
 2   2       6483 non-null   object
 3   3       6483 non-null   object
 4   4       6483 non-null   object
 5   5       6483 non-null   object
dtypes: object(6)
memory usage: 304.0+ KB
None


## Dataset 9: Enron Datasets temp2001.csv

### Loading and Printing  Dataset

In [54]:
df_9 = pd.read_csv(r"Dataset\Extra Dataset\Enron Datasets\temp2001.csv")

print("Dataset is:\n")

print(df_9.head())

Dataset is:

    emailDate  totalEmails  deviatedValue
0  2001-01-01           46          -3.78
1  2001-01-02          369           0.49
2  2001-01-03          391           0.15
3  2001-01-04          341           3.30
4  2001-01-05          340           2.92


### Seeing the Shape of the Dataset

In [55]:
print("Shape of the Dataset is:\n")

print(df_9.shape)

Shape of the Dataset is:

(365, 3)


### Seeing the Columns of the Dataset

In [56]:
print("Columns of the Dataset are:\n")

print(df_9.columns)

Columns of the Dataset are:

Index(['emailDate', 'totalEmails', 'deviatedValue'], dtype='object')


### Seeing the Information of the Dataset

In [57]:
print("Information of the Dataset is:\n")

print(df_9.info())

Information of the Dataset is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 365 entries, 0 to 364
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   emailDate      365 non-null    object 
 1   totalEmails    365 non-null    int64  
 2   deviatedValue  365 non-null    float64
dtypes: float64(1), int64(1), object(1)
memory usage: 8.7+ KB
None


## Hillary Clinton Datasets

## Dataset 10: Hillary Clinton Datasets Aliases.csv

### Loading and Printing  Dataset

In [58]:
df_10 = pd.read_csv(r"Dataset\Extra Dataset\Hillary Clinton Datasets\Aliases.csv")

print("Dataset is:\n")
print(df_10.head())

Dataset is:

   Id                         Alias  PersonId
0   1                111th congress         1
1   2  agna usemb kabul afghanistan         2
2   3                            ap         3
3   4                      asuncion         4
4   5                          alec         5


### Seeing the Shape of the Dataset

In [59]:
print("Shape of the Dataset is:\n")

print(df_10.shape)

Shape of the Dataset is:

(850, 3)


### Seeing the Columns of the Dataset

In [60]:
print("Columns of the Dataset are:\n")

print(df_10.columns)

Columns of the Dataset are:

Index(['Id', 'Alias', 'PersonId'], dtype='object')


### Seeing the Information of the Dataset

In [61]:
print("Information of the Dataset is:\n")

print(df_10.info())

Information of the Dataset is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 850 entries, 0 to 849
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Id        850 non-null    int64 
 1   Alias     850 non-null    object
 2   PersonId  850 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 20.1+ KB
None


## Dataset 11: Hillary Clinton Datasets EmailReceivers.csv

### Loading and Printing  Dataset

In [62]:
df_11 = pd.read_csv(r"Dataset\Extra Dataset\Hillary Clinton Datasets\EmailReceivers.csv")

print("Dataset is:\n")
print(df_11.head())

Dataset is:

   Id  EmailId  PersonId
0   1        1        80
1   2        2        80
2   3        3       228
3   4        3        80
4   5        4        80


### Seeing the Shape of the Dataset

In [63]:
print("Shape of the Dataset is:\n")

print(df_11.shape)

Shape of the Dataset is:

(9306, 3)


### Seeing the Columns of the Dataset

In [64]:
print("Columns of the Dataset are:\n")

print(df_11.columns)

Columns of the Dataset are:

Index(['Id', 'EmailId', 'PersonId'], dtype='object')


### Seeing the Information of the Dataset

In [65]:
print("Information of the Dataset is:\n")

print(df_11.info())

Information of the Dataset is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9306 entries, 0 to 9305
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   Id        9306 non-null   int64
 1   EmailId   9306 non-null   int64
 2   PersonId  9306 non-null   int64
dtypes: int64(3)
memory usage: 218.2 KB
None


## Dataset 12: Hillary Clinton Datasets Emails.csv

### Loading and Printing  Dataset

In [66]:
df_12 = pd.read_csv(r"Dataset\Extra Dataset\Hillary Clinton Datasets\Emails.csv")

print("Dataset is:\n")
print(df_12.head())

Dataset is:

   Id  DocNumber                                    MetadataSubject  \
0   1  C05739545                                                WOW   
1   2  C05739546  H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...   
2   3  C05739547                                      CHRIS STEVENS   
3   4  C05739550                         CAIRO CONDEMNATION - FINAL   
4   5  C05739554  H: LATEST: HOW SYRIA IS AIDING QADDAFI AND MOR...   

     MetadataTo       MetadataFrom  SenderPersonId           MetadataDateSent  \
0             H  Sullivan, Jacob J            87.0  2012-09-12T04:00:00+00:00   
1             H                NaN             NaN  2011-03-03T05:00:00+00:00   
2            ;H    Mills, Cheryl D            32.0  2012-09-12T04:00:00+00:00   
3             H    Mills, Cheryl D            32.0  2012-09-12T04:00:00+00:00   
4  Abedin, Huma                  H            80.0  2011-03-11T05:00:00+00:00   

        MetadataDateReleased  \
0  2015-05-22T04:00:00+00:00   
1  2015-0

### Seeing the Shape of the Dataset

In [69]:
print("Shape of the Dataset is:\n")

print(df_12.shape)

Shape of the Dataset is:

(7945, 22)


### Seeing the Columns of the Dataset

In [68]:
print("Columns of the Dataset are:\n")

print(df_12.columns)

Columns of the Dataset are:

Index(['Id', 'DocNumber', 'MetadataSubject', 'MetadataTo', 'MetadataFrom',
       'SenderPersonId', 'MetadataDateSent', 'MetadataDateReleased',
       'MetadataPdfLink', 'MetadataCaseNumber', 'MetadataDocumentClass',
       'ExtractedSubject', 'ExtractedTo', 'ExtractedFrom', 'ExtractedCc',
       'ExtractedDateSent', 'ExtractedCaseNumber', 'ExtractedDocNumber',
       'ExtractedDateReleased', 'ExtractedReleaseInPartOrFull',
       'ExtractedBodyText', 'RawText'],
      dtype='object')


### Seeing the Information of the Dataset

In [67]:
print("Information of the Dataset is:\n")

print(df_12.info())

Information of the Dataset is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7945 entries, 0 to 7944
Data columns (total 22 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Id                            7945 non-null   int64  
 1   DocNumber                     7945 non-null   object 
 2   MetadataSubject               7649 non-null   object 
 3   MetadataTo                    7690 non-null   object 
 4   MetadataFrom                  7788 non-null   object 
 5   SenderPersonId                7788 non-null   float64
 6   MetadataDateSent              7813 non-null   object 
 7   MetadataDateReleased          7945 non-null   object 
 8   MetadataPdfLink               7945 non-null   object 
 9   MetadataCaseNumber            7945 non-null   object 
 10  MetadataDocumentClass         7945 non-null   object 
 11  ExtractedSubject              6260 non-null   object 
 12  ExtractedTo                   

## Dataset 13: Hillary Clinton Datasets Persons.csv

### Loading and Printing  Dataset

In [70]:
df_13 = pd.read_csv(r"Dataset\Extra Dataset\Hillary Clinton Datasets\Persons.csv")

print("Dataset is:\n")
print(df_13.head())

Dataset is:

   Id                          Name
0   1                111th Congress
1   2  AGNA USEMB Kabul Afghanistan
2   3                            AP
3   4                      ASUNCION
4   5                          Alec


### Seeing the Shape of the Dataset

In [71]:
print("Shape of the Dataset is:\n")

print(df_13.shape)

Shape of the Dataset is:

(513, 2)


### Seeing the Columns of the Dataset

In [72]:
print("Columns of the Dataset are:\n")

print(df_13.columns)

Columns of the Dataset are:

Index(['Id', 'Name'], dtype='object')


### Seeing the Information of the Dataset

In [73]:
print("Information of the Dataset is:\n")

print(df_13.info())

Information of the Dataset is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 513 entries, 0 to 512
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Id      513 non-null    int64 
 1   Name    513 non-null    object
dtypes: int64(1), object(1)
memory usage: 8.1+ KB
None


## Email Description and Classification Dataset

## Dataset 14: Email Dataset.csv

### Loading and Printing  Dataset

In [75]:
df_14 = pd.read_csv(r"Dataset\Extra Dataset\email_dataset.csv")

print("Dataset is:\n")
print(df_14.head())

Dataset is:

         Message ID                                   From  \
0  17d4d46e0a0a4383             contestinvite@codechef.com   
1  17d4d11679830d85          messages-noreply@linkedin.com   
2  17d4ca8aba50de36  messaging-digest-noreply@linkedin.com   
3  17d4c8f5b6467c68                   no-reply@spotify.com   
4  17d4c882ac4bb740         jobalerts-noreply@linkedin.com   

                            From (name)                   To  \
0                     Meg from CodeChef  utaku0611@gmail.com   
1                              LinkedIn  utaku0611@gmail.com   
2  ARUMULLA YASWANTH REDDY via LinkedIn  utaku0611@gmail.com   
3                               Spotify  utaku0611@gmail.com   
4                   LinkedIn Job Alerts  utaku0611@gmail.com   

             To (Name)                                            Subject  \
0  utaku0611@gmail.com  Coders, you're invited to C.O.D.E.R.S (Div 2 &...   
1     Priyanjali Gupta                        341 people are noticing you  

### Seeing the Shape of the Dataset

In [76]:
print("Shape of the Dataset is:\n")

print(df_14.shape)

Shape of the Dataset is:

(293, 10)


### Seeing the Columns of the Dataset

In [77]:
print("Columns of the Dataset are:\n")

print(df_14.columns)

Columns of the Dataset are:

Index(['Message ID', 'From', 'From (name)', 'To', 'To (Name)', 'Subject',
       'Date Sent', 'Date Received', 'Email Text', 'spam'],
      dtype='object')


### Seeing the Information of the Dataset

In [78]:
print("Information of the Dataset is:\n")

print(df_14.info())

Information of the Dataset is:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293 entries, 0 to 292
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Message ID     293 non-null    object
 1   From           293 non-null    object
 2   From (name)    293 non-null    object
 3   To             293 non-null    object
 4   To (Name)      281 non-null    object
 5   Subject        292 non-null    object
 6   Date Sent      293 non-null    object
 7   Date Received  293 non-null    object
 8   Email Text     293 non-null    object
 9   spam           293 non-null    int64 
dtypes: int64(1), object(9)
memory usage: 23.0+ KB
None


### Final Dataset Selection: Email Description and Classification Dataset

After thoroughly exploring and analyzing various datasets, we have selected **Dataset 14: Email Description and Classification Dataset** as the most suitable for our project. This dataset provides comprehensive information that aligns perfectly with our objectives and allows for additional feature extraction to improve the model's performance.

### Email Spam Classification Project

This project focuses on building a machine learning model to classify spam emails using a detailed dataset. The dataset includes information such as sender, receiver, subject lines, email content, and spam labels. The aim is to analyze these features and train a model to accurately identify spam.

### Dataset Details:
The **Email Description and Classification Dataset** provides key insights into:
- **Sender and Receiver Information**: Helps analyze patterns in communication.
- **Email Subject and Body**: Useful for extracting meaningful terms and spotting spam indicators.
- **Spam Labels**: Clearly marked to train and evaluate the model’s accuracy.

Additionally, this dataset allows for extracting advanced features like no-reply addresses, company-specific keywords, and response patterns to further improve the spam detection system.