## Email Crawler

To get started, we don't have to install anything. All the modules used in this tutorial are the built-in ones:

In [1]:
import imaplib
import email
from email.header import decode_header
import webbrowser
import os
import re # I need this package to filter particular text-patterns in my emails. --> Note: it is already built-in!
import pandas as pd # to create the Dataframes at the end

# account credentials
username = "my.email@example.com" # <-- Place your email-account that you want to scrape in here. I deleted my credentials, since you should NEVER push them to Github!
password = "Your_Email_Password" # <-- Paste in your email-password here. I deleted my credentials, since you should NEVER push them to Github!
# use your email provider's IMAP server, you can look for your provider's IMAP server on Google
# or check this page: https://www.systoolsgroup.com/imap/
# for hotmail, it's this:
imap_server = "imap-mail.outlook.com" # checkout this link for other email-servers: https://www.systoolsgroup.com/imap/

def clean(text):
    # clean text for creating a folder
    return "".join(c if c.isalnum() else "_" for c in text)

- <u>Link to other Email Servers</u>: https://www.systoolsgroup.com/imap/
- <u>Note</u>: We will need the `clean()`-function later to create folders without spaces and special characters (= e.g. an NLP-technique to clean text-based data).

### 1) Establish a connection to the Email-Server

In [2]:
# create an IMAP4 class with SSL 
imap = imaplib.IMAP4_SSL(imap_server)
# authenticate
imap.login(username, password)

('OK', [b'LOGIN completed.'])

### 2) Get <u>all</u> the Emails from the "Jobs"-Folder

In [3]:
status, messages = imap.select("Jobs")

# number of emails to fetch (from the "top" - e.g. the most recent - to the "bottom"...)
N = 234 # <-- Key: Modify THIS bit to get the correct total of emails in your inbox!

# "messages" will contain the total number of emails in the folder "Jobs"
messages = int(messages[0])
print(messages) # check, if it worked? --> we see that we currently have 232 emails in the folder "Jobs"

234


- <u>Exploration of your Mail-Box's Folders</u>: You can use the `imap.list()` method to see the available mailboxes.

In [4]:
imap.list()

('OK',
 [b'(\\HasNoChildren) "/" Archiv',
  b'(\\HasNoChildren \\Drafts) "/" Drafts',
  b'(\\HasChildren \\Trash) "/" Deleted',
  b'(\\HasNoChildren \\Sent) "/" Sent',
  b'(\\HasChildren) "/" Jobs',
  b'(\\HasNoChildren \\Junk) "/" Junk',
  b'(\\HasNoChildren) "/" Notes',
  b'(\\HasNoChildren) "/" Outbox',
  b'(\\Marked \\HasNoChildren) "/" Inbox',
  b'(\\HasChildren) "/" Synchronisierungsprobleme',
  b'(\\HasNoChildren) "/" Synchronisierungsprobleme/Konflikte',
  b'(\\HasNoChildren) "/" "Synchronisierungsprobleme/Lokale Fehler"',
  b'(\\HasNoChildren) "/" Synchronisierungsprobleme/Serverfehler'])

### 3) Read some emails from our "Jobs"-Folder

In the next 2 cells, **I will construct a time series dataset**.

<u>This dataset will</u>:

- Capture the **dynamics (over time) of entry-position jobs in the Swiss jobmarket for graduate economics & data science students**.
- Track the number of _new_ open position each day in the Swiss jobmarket.

_Note that the data I collected was generated by an algorithm that is provided by one of the biggest digital Swiss job-portals_.

In [5]:
dict_for_df = dict() # this empty dictionary will be transformed into the final time series ds later...
dates_array = [] # this will be my column "Date" in my later ds
total_jobs_available_today_array = [] # will be my column "Total Jobs available Today" 
new_jobs_today_array = [] # will be my column "New Jobs published Today"

### For 2nd DF:
dict_for_df2 = dict()
urls_array = [] # will be a list of lists (with an index of 5 --> this is KEY, because we need it to have the same length as the dates, since the dates will be our "observations"-unit in the DF2!)
job_titles_array = [] # same as "urls_array"...
companies_array = [] # same as "urls_array"...


In [6]:
for i in range(messages, messages-N, -1): # Why do we do this weird "backward"-loop? --> we want to iterate from the top to the bottom 
    # Key: in order to go "forward in time" (starting in the distant past), we need to iterate by typing in the 
    # following parameters: `messages-N, messages+1` and leave out "-1"(!!) at the end as the 3rd // last argument...
    # Backwards would be: `range(messages, messages-N, -1)`
    # fetch the email message by ID
    print("Currently reading {0}-th Email...".format(i))
    res, msg = imap.fetch(str(i), "(RFC822)") # `RFC822` is a special format that we can use to fetch the emails from the server: https://www.rfc-editor.org/rfc/rfc822
    for response in msg:
        if isinstance(response, tuple):
            msg = email.message_from_bytes(response[1]) # parse the bytes returned by the `fetch()`-method to a proper "Message"-object
            subject, encoding = decode_header(msg["Subject"])[0] # decode the "subject" of the email-address to human-readable Unicode.
            if isinstance(subject, bytes): # if the "subject" is from the data-type "bytes", decode to str
                subject = subject.decode(encoding)
            From, encoding = decode_header(msg.get("From"))[0] # decode email-sender (= "From") of the email-address to human-readable Unicode.
            if isinstance(From, bytes): # if the sender (= "From") is from the data-type "bytes", decode to str
                From = From.decode(encoding)
            print("Subject:", subject)
            print("From:", From)
            print(msg['Date'])
            todays_email_date = msg['Date']
            dates_array.append(todays_email_date) # append the dates to my empty array
            # if the email message is "multipart":  for instance, an email message can contain the "text/html"-content AND "text/plain"-parts, e.g. it has the HTML and(!) plain text versions of the message.
            if msg.is_multipart():
                # iterate over email parts
                for part in msg.walk():
                    # extract content type of email
                    content_type = part.get_content_type()
                    content_disposition = str(part.get("Content-Disposition"))
                    try:
                        # get the email body
                        body = part.get_payload(decode=True).decode()
                        lines = body.split('\n')
                        job_titles = []
                        companies = []
                        for line in lines:
                            if "https://www.jobs.ch/de/stellenangebote/detail/" in line:
                                job_titles.append(lines[lines.index(line) - 2])
                                companies.append(lines[lines.index(line) - 1])
                        matches_plural = re.findall(r".*neue Jobs.*", body)
                        matches_singular = re.findall(r".*neuer Job.*", body)
                        matches = matches_singular + matches_plural # concatenate the 2 lists into 1 (bigger) list
                        new_jobs_per_day = list(map(lambda string: int(string[0:2]), matches)) # take only the first element of each string (= which is the "number" - currently given as a `string` - that we are interested in, in order to calculate the total number of new open job position that opened "today")
                        total_new_jobs_today = sum(new_jobs_per_day)
                        urls = re.findall(r"https://www.jobs.ch/de/stellenangebote/detail/\S+", body)
                        currently_open_job_positions = len(urls)
                        total_jobs_available_today_array.append(currently_open_job_positions) 
                        new_jobs_today_array.append(total_new_jobs_today)
                        #print(matches)
                        #print(new_jobs_per_day)
                        #print(currently_open_job_positions)
                        #print(total_new_jobs_today)
                        #print(urls)
                        urls_array.append(urls)
                        #print(len(urls))
                        #print(job_titles)
                        job_titles_array.append(job_titles)
                        #print(len(job_titles))
                        #print(companies)
                        companies_array.append(companies)
                        #print(len(companies))
                        #print(body)
                        #print(matches_singular)
                        #print(matches_plural)
                    except:
                        pass
                    #if content_type == "text/plain" and "attachment" not in content_disposition:
                        # print text/plain emails and skip attachments
                        #print(body)
                    if "attachment" in content_disposition: # vorher: `elif` (statt `if`!)
                        # download attachment
                        filename = part.get_filename()
                        if filename:
                            folder_name = clean(subject)
                            if not os.path.isdir(folder_name):
                                # make a folder for this email (named after the subject)
                                os.mkdir(folder_name)
                            filepath = os.path.join(folder_name, filename)
                            # download attachment and save it
                            open(filepath, "wb").write(part.get_payload(decode=True))
            else:
                # extract content type of email
                content_type = msg.get_content_type()
                # get the email body
                body = msg.get_payload(decode=True).decode()
                #if content_type == "text/plain":
                    # print only text email parts
                    #print(body)
            #if content_type == "text/html":
                # if it's HTML, create a new HTML file and open it in browser
                #folder_name = clean(subject)
                #if not os.path.isdir(folder_name):
                    # make a folder for this email (named after the subject)
                    #os.mkdir(folder_name)
                #filename = "index.html"
                #filepath = os.path.join(folder_name, filename)
                # write the file
                #open(filepath, "w").write(body)
                # open in the default browser
                #webbrowser.open(filepath)
            print("="*100)
dict_for_df['Date'] = dates_array
dict_for_df['Total of open Job-Positions up until today'] = total_jobs_available_today_array
dict_for_df['New Jobs published Today'] = new_jobs_today_array

### 2nd DF:
for (date, url, title, company) in zip(dates_array, urls_array,  job_titles_array, companies_array):
    dict_for_df2[date] = [url, title, company] # this will add new keys to the dictionary, where the 'dates' are the keys and a list (of sublists, containing the URLs, Job-Titles and Company-Names) as the values...


Currently reading 234-th Email...
Subject: 59 neue Stellenangebote gefunden
From: "jobs.ch Job-Alarm" <jobmail@jobs.ch>
Wed, 4 Jan 2023 06:16:14 +0000
Currently reading 233-th Email...
Subject: 8 neue Stellenangebote gefunden
From: "jobs.ch Job-Alarm" <jobmail@jobs.ch>
Tue, 3 Jan 2023 05:51:23 +0000
Currently reading 232-th Email...
Subject: 17 neue Stellenangebote gefunden
From: "jobs.ch Job-Alarm" <jobmail@jobs.ch>
Mon, 2 Jan 2023 05:50:05 +0000
Currently reading 231-th Email...
Subject: 40 neue Stellenangebote gefunden
From: "jobs.ch Job-Alarm" <jobmail@jobs.ch>
Sat, 31 Dec 2022 05:57:44 +0000
Currently reading 230-th Email...
Subject: 34 neue Stellenangebote gefunden
From: "jobs.ch Job-Alarm" <jobmail@jobs.ch>
Wed, 28 Dec 2022 06:10:27 +0000
Currently reading 229-th Email...
Subject: 36 neue Stellenangebote gefunden
From: "jobs.ch Job-Alarm" <jobmail@jobs.ch>
Thu, 29 Dec 2022 05:55:44 +0000
Currently reading 228-th Email...
Subject: 24 neue Stellenangebote gefunden
From: "jobs.ch J

- <u>Note</u>: If you want to extract the oldest email addresses, you need to change the loop to something like `range(N)`.

<u>Next, I will create a **second dataframe**, whose purpose will be</u>:

- To collect the **URLs** that redirect to the currently open job-application, 
- As well as the corresponding **job-title** & **company name**,
- And the job's **publishing-date**.

In [11]:
print(len(urls_array))
print(len(job_titles_array))
print(len(companies_array))

234
234
234


In [None]:
msg.keys() # see available email-fields from which you can extract data. 

### 4) Create your DataFrames

First, let's create the **time series dataset**, _where the "Date"-column serves as a **unique ID**_.

In [None]:
df = pd.DataFrame(dict_for_df)
df

In [None]:
df['Date'] = pd.to_datetime(df['Date'])
df.sort_values(by="Date", inplace=True) # reverse the order of the DF (currently the "newest" date is in the front --> bring it into the back)

In [None]:
df

#### Create the 2nd DataFrame

In [7]:
# Create the dataframe
rows = []
for key, values in dict_for_df2.items():
    for value in values:
        row = {key: value}
        rows.append(row)

list_of_df = []
for i in rows:
    df2 = pd.DataFrame(i)
    list_of_df.append(df2)

# Problem in all the DFs? --> we have the Date as 1 column, instead of rows... --> Solution: transform the DF into "long-format"...
test = list_of_df[1]
test.stack().droplevel(level=0).to_frame().rename({0: 'Job Title'}, axis='columns') # Key: 'droplevel'-method, because I had an unnecessary multi-index...  

url_series = pd.Series()
#all_series = pd.Series()
# Transform all the DFs for the column 'URL':
for i in list_of_df[0::3]: # Key: we start with the 1st list-element of 'list_of_df' and then only iterate over every 3rd elements
    url_series = pd.concat([url_series, i.stack().droplevel(level=0)])
    
title_series = pd.Series()
# Transform all the DFs for the column 'URL':
for i in list_of_df[1::3]: # Key: we start with the 1st list-element of 'list_of_df' and then only iterate over every 3rd elements
    title_series = pd.concat([title_series, i.stack().droplevel(level=0)])
    
company_series = pd.Series()
# Transform all the DFs for the column 'URL':
for i in list_of_df[2::3]: # Key: we start with the 1st list-element of 'list_of_df' and then only iterate over every 3rd elements
    company_series = pd.concat([company_series, i.stack().droplevel(level=0)])
    
#for i in list_of_df[0::3]: # Key: we start with the 1st list-element of 'list_of_df' and then only iterate over every 3rd elements
#    all_series = pd.concat([all_series, i.stack().droplevel(level=0)])
#    for j in list_of_df[1::3]: # Key: we start with the 1st list-element of 'list_of_df' and then only iterate over every 3rd elements
#        all_series = pd.concat([all_series, j.stack().droplevel(level=0)])
#        for k in list_of_df[2::3]: # Key: we start with the 1st list-element of 'list_of_df' and then only iterate over every 3rd elements
#            all_series = pd.concat([all_series, k.stack().droplevel(level=0)])

url_df = url_series.to_frame().rename({0: 'url'}, axis='columns').reset_index().rename({'index': 'Date'}, axis='columns')
title_df = title_series.to_frame().rename({0: 'Job Title'}, axis='columns').reset_index().rename({'index': 'Date'}, axis='columns')
company_df = company_series.to_frame().rename({0: 'Employer Company'}, axis='columns').reset_index().rename({'index': 'Date'}, axis='columns')

url_df['Date'] = pd.to_datetime(url_df['Date'])
title_df['Date'] = pd.to_datetime(title_df['Date'])
company_df['Date'] = pd.to_datetime(company_df['Date'])


  url_series = pd.Series()
  title_series = pd.Series()
  company_series = pd.Series()


In [12]:
len(list_of_df) # check: I crawled 5 of my emails, where I needed data on 3 columns FOR EACH email 
# Hence - for each date - I should have: "3 (col) * 5 (emails // dates) = 15" sub-lists of data in total ? --> yes!

NameError: name 'list_of_df' is not defined

In [8]:
# Merging...
noob = pd.merge(
    url_df,
    title_df,
    how="left",
    on=url_df.index, # Key: we merge via the Index, which are only numbers from 0-184 --> merging via index. Note that I needed to put the "Date" as a column (and NOT as an index!)
).drop(['key_0', 'Date_y'], axis = 1) # we only need one of the "Date"-Columns. Also, we can drop the automatically generated column "key_0", which is a leftover-column that gets generated, because we merged via the "index". 

# second merge, since we need to combine 3 DFs into 1 "big" DF --> we currently only merged 2 (out of 3) DFs, that's why we need this 2nd merging...
noob = pd.merge(
    noob,
    company_df,
    how="left",
    on=noob.index, # Key: Again, we merge via the Index, which are ALL our observations...
).drop(['key_0', 'Date_x'], axis = 1)#.set_index('Date') # same as first merging: we drop one of the "Date"-Columns, as well as the automatically generated "index"-column (= 'key_0')... 

df2 = noob.sort_values(by="Date", ascending = False)
df2 = df2.reindex(columns=['Date', 'url', 'Job Title', 'Employer Company']) # change the order of columns: i want "date"-col to be the first one...

df2['Job Title'] = [el.replace("\r", "") for el in df2['Job Title']] # we need to remove the "\r" behind all the elements of the col of "Job Title" & "Employer Company", otherwise saving the ds correctly will not be possible! xD
df2['Employer Company'] = [el.replace("\r", "") for el in df2['Employer Company']] # same as above

In [18]:
df2 = df2.reset_index().drop(['index'], axis = 1)

### 5) Save the generated DataFrame

In [None]:
df.to_csv('../data/jobs-over-time.csv', index=False)

In [36]:
df2.to_csv('../data/job-urls.csv', index=False)
df2.to_json('../data/job-urls.json')

In [35]:
df2

Unnamed: 0,Date,url,Job Title,Employer Company
0,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Data Scientist - Optimierung Filialperformance...,"Migros-Genossenschafts-Bund, Zürich"
1,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Praktikant/in Backoffice 100% [Ref:1785],"Freestar-Informatik AG, Zürich"
2,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Salesforce Consulting Trainee at a leading Men...,"Sparrow Ventures, Zürich"
3,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Praktikum im Bereich Manufacturing Engineering,"Biotronik AG, Bülach"
4,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Internship in Sterilization and Contamination ...,"Biotronik AG, Bülach"
...,...,...,...,...
9718,2021-11-01 07:04:45+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Praktikant:in Social Media und Community SRF News,"Schweizer Radio und Fernsehen, Zürich"
9719,2021-11-01 07:04:45+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Initial Training Academy (ATCO) Instructor: AC...,"skyguide, Dübendorf"
9720,2021-11-01 07:04:45+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Praktikant:in Social Media und Community SRF News,"Schweizer Radio und Fernsehen, Zürich"
9721,2021-11-01 07:04:45+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Aktuar oder Ökonom in der Versicherungsaufsich...,"Eidgenössische Finanzmarktaufsicht FINMA, Bern"


In [34]:
test = pd.read_csv(
    "/Users/jomaye/Documents/Programming/04-DS-repos/Python/email-scraper/job-urls.csv"
) 
test # check, if it worked?

Unnamed: 0,Date,url,Job Title,Employer Company
0,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Data Scientist - Optimierung Filialperformance...,"Migros-Genossenschafts-Bund, Zürich"
1,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Praktikant/in Backoffice 100% [Ref:1785],"Freestar-Informatik AG, Zürich"
2,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Salesforce Consulting Trainee at a leading Men...,"Sparrow Ventures, Zürich"
3,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Praktikum im Bereich Manufacturing Engineering,"Biotronik AG, Bülach"
4,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Internship in Sterilization and Contamination ...,"Biotronik AG, Bülach"
...,...,...,...,...
9718,2021-11-01 07:04:45+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Praktikant:in Social Media und Community SRF News,"Schweizer Radio und Fernsehen, Zürich"
9719,2021-11-01 07:04:45+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Initial Training Academy (ATCO) Instructor: AC...,"skyguide, Dübendorf"
9720,2021-11-01 07:04:45+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Praktikant:in Social Media und Community SRF News,"Schweizer Radio und Fernsehen, Zürich"
9721,2021-11-01 07:04:45+00:00,https://www.jobs.ch/de/stellenangebote/detail/...,Aktuar oder Ökonom in der Versicherungsaufsich...,"Eidgenössische Finanzmarktaufsicht FINMA, Bern"


### 6) Close the Connection to the Email-Server

In [13]:
# close the connection and logout
imap.close()
imap.logout()

('BYE', [b'Microsoft Exchange Server IMAP4 server signing off.'])

### 7) Next Steps?

- **Update your dataset**: Make an `update-data.py`-file, which will always attach the newest data to your ds every day.
- **Create a UI**: You download the `.csv`-file with the newest data and populate an `.html`-file with `<a>`-tags that contain all these URLs.

### 💭 Important Things to know about generating those Datasets! 🧐

#### <u>The Outputs from Crawling my Emails</u>:

In [6]:
for i in range(messages, messages-N, -1): # Why do we do this weird "backward"-loop? --> we want to iterate from the top to the bottom 
    # Key: in order to go "forward in time" (starting in the distant past), we need to iterate by typing in the 
    # following parameters: `messages-N, messages+1` and leave out "-1"(!!) at the end as the 3rd // last argument...
    # Backwards would be: `range(messages, messages-N, -1)`
    # fetch the email message by ID
    res, msg = imap.fetch(str(i), "(RFC822)") # `RFC822` is a special format that we can use to fetch the emails from the server: https://www.rfc-editor.org/rfc/rfc822
    for response in msg:
        if isinstance(response, tuple):
            msg = email.message_from_bytes(response[1]) # parse the bytes returned by the `fetch()`-method to a proper "Message"-object
            subject, encoding = decode_header(msg["Subject"])[0] # decode the "subject" of the email-address to human-readable Unicode.
            if isinstance(subject, bytes): # if the "subject" is from the data-type "bytes", decode to str
                subject = subject.decode(encoding)
            From, encoding = decode_header(msg.get("From"))[0] # decode email-sender (= "From") of the email-address to human-readable Unicode.
            if isinstance(From, bytes): # if the sender (= "From") is from the data-type "bytes", decode to str
                From = From.decode(encoding)
            print("Subject:", subject)
            print("From:", From)
            print(msg['Date'])
            todays_email_date = msg['Date']
            dates_array.append(todays_email_date) # append the dates to my empty array
            # if the email message is "multipart":  for instance, an email message can contain the "text/html"-content AND "text/plain"-parts, e.g. it has the HTML and(!) plain text versions of the message.
            if msg.is_multipart():
                # iterate over email parts
                for part in msg.walk():
                    # extract content type of email
                    content_type = part.get_content_type()
                    content_disposition = str(part.get("Content-Disposition"))
                    try:
                        # get the email body
                        body = part.get_payload(decode=True).decode()
                        lines = body.split('\n')
                        job_titles = []
                        companies = []
                        for line in lines:
                            if "https://www.jobs.ch/de/stellenangebote/detail/" in line:
                                job_titles.append(lines[lines.index(line) - 2])
                                companies.append(lines[lines.index(line) - 1])
                        matches_plural = re.findall(r".*neue Jobs.*", body)
                        matches_singular = re.findall(r".*neuer Job.*", body)
                        matches = matches_singular + matches_plural # concatenate the 2 lists into 1 (bigger) list
                        new_jobs_per_day = list(map(lambda string: int(string[0:2]), matches)) # take only the first element of each string (= which is the "number" - currently given as a `string` - that we are interested in, in order to calculate the total number of new open job position that opened "today")
                        total_new_jobs_today = sum(new_jobs_per_day)
                        urls = re.findall(r"https://www.jobs.ch/de/stellenangebote/detail/\S+", body)
                        currently_open_job_positions = len(urls)
                        total_jobs_available_today_array.append(currently_open_job_positions) 
                        new_jobs_today_array.append(total_new_jobs_today)
                        print(matches)
                        print(new_jobs_per_day)
                        print(currently_open_job_positions)
                        print(total_new_jobs_today)
                        print(urls)
                        urls_array.append(urls)
                        print(len(urls))
                        print(job_titles)
                        job_titles_array.append(job_titles)
                        print(len(job_titles))
                        print(companies)
                        companies_array.append(companies)
                        print(len(companies))
                        #print(body)
                        #print(matches_singular)
                        #print(matches_plural)
                    except:
                        pass
                    #if content_type == "text/plain" and "attachment" not in content_disposition:
                        # print text/plain emails and skip attachments
                        #print(body)
                    if "attachment" in content_disposition: # vorher: `elif` (statt `if`!)
                        # download attachment
                        filename = part.get_filename()
                        if filename:
                            folder_name = clean(subject)
                            if not os.path.isdir(folder_name):
                                # make a folder for this email (named after the subject)
                                os.mkdir(folder_name)
                            filepath = os.path.join(folder_name, filename)
                            # download attachment and save it
                            open(filepath, "wb").write(part.get_payload(decode=True))
            else:
                # extract content type of email
                content_type = msg.get_content_type()
                # get the email body
                body = msg.get_payload(decode=True).decode()
                #if content_type == "text/plain":
                    # print only text email parts
                    #print(body)
            #if content_type == "text/html":
                # if it's HTML, create a new HTML file and open it in browser
                #folder_name = clean(subject)
                #if not os.path.isdir(folder_name):
                    # make a folder for this email (named after the subject)
                    #os.mkdir(folder_name)
                #filename = "index.html"
                #filepath = os.path.join(folder_name, filename)
                # write the file
                #open(filepath, "w").write(body)
                # open in the default browser
                webbrowser.open(filepath)
            print("="*100)
dict_for_df['Date'] = dates_array
dict_for_df['Total of open Job-Positions up until today'] = total_jobs_available_today_array
dict_for_df['New Jobs published Today'] = new_jobs_today_array

### 2nd DF:
for (date, url, title, company) in zip(dates_array, urls_array,  job_titles_array, companies_array):
    dict_for_df2[date] = [url, title, company] # this will add new keys to the dictionary, where the 'dates' are the keys and a list (of sublists, containing the URLs, Job-Titles and Company-Names) as the values...


Subject: 59 neue Stellenangebote gefunden
From: "jobs.ch Job-Alarm" <jobmail@jobs.ch>
Wed, 4 Jan 2023 06:16:14 +0000
['7 neue Jobs für Data Scientist Jobs\r', '3 neue Jobs für Web Developer Jobs\r', '16 neue Jobs für Studentische Hilfskraft Jobs\r', '16 neue Jobs für Studentischer Mitarbeiter Jobs\r', '17 neue Jobs für Student Jobs\r']
[7, 3, 16, 16, 17]
51
59
['https://www.jobs.ch/de/stellenangebote/detail/213d4319-f0a4-42ec-836b-5dc9417a48d4/?hash=5c9995cb43cf3073dcaf0fd3068564c9&profile-id=3e604a99-5577-4a68-b147-8522488c1e04&reference-date=2023-01-02&source=job_alert_email_direct&pid=1&utm_source=automail-job-push&utm_medium=email&utm_campaign=wb%3Ajobs%7Ctg%3Ab2c%7Ccn%3Aww%7Clg%3Ade%7Cmg%3Ajob-views%7Cpd%3An%7Ccd%3Ajob-alert&utm_content=job-link&uid=6c67e29e-faf0-4238-ad1d-0e2a6c78868c&mid=58a95130-2750-462f-9375-b6edc0abfac2', 'https://www.jobs.ch/de/stellenangebote/detail/9f6dad5c-bcaa-498f-b3c9-78e9fe2afdc6/?hash=5c9995cb43cf3073dcaf0fd3068564c9&profile-id=3e604a99-5577-4a68-b1

#### The "Special Dictionary" for the second DataFrame

To generate the **2nd Dataset**, I matched each _column_ (= the <u>values</u> of the `dict`) I wanted with the respective _date_ (= the <u>keys</u> of the `dict`) the Job was published.

**Note that the dictionary's <u>values</u> are _nested lists_, where each _sublist reflects a column_ in my future DF**.

In [8]:
dict_for_df2

{'Wed, 4 Jan 2023 06:16:14 +0000': [['https://www.jobs.ch/de/stellenangebote/detail/213d4319-f0a4-42ec-836b-5dc9417a48d4/?hash=5c9995cb43cf3073dcaf0fd3068564c9&profile-id=3e604a99-5577-4a68-b147-8522488c1e04&reference-date=2023-01-02&source=job_alert_email_direct&pid=1&utm_source=automail-job-push&utm_medium=email&utm_campaign=wb%3Ajobs%7Ctg%3Ab2c%7Ccn%3Aww%7Clg%3Ade%7Cmg%3Ajob-views%7Cpd%3An%7Ccd%3Ajob-alert&utm_content=job-link&uid=6c67e29e-faf0-4238-ad1d-0e2a6c78868c&mid=58a95130-2750-462f-9375-b6edc0abfac2',
   'https://www.jobs.ch/de/stellenangebote/detail/9f6dad5c-bcaa-498f-b3c9-78e9fe2afdc6/?hash=5c9995cb43cf3073dcaf0fd3068564c9&profile-id=3e604a99-5577-4a68-b147-8522488c1e04&reference-date=2023-01-02&source=job_alert_email_direct&pid=1&utm_source=automail-job-push&utm_medium=email&utm_campaign=wb%3Ajobs%7Ctg%3Ab2c%7Ccn%3Aww%7Clg%3Ade%7Cmg%3Ajob-views%7Cpd%3An%7Ccd%3Ajob-alert&utm_content=job-link&uid=6c67e29e-faf0-4238-ad1d-0e2a6c78868c&mid=58a95130-2750-462f-9375-b6edc0abfac2'

### "Special" For-Loop

In order to create the dataset with the URL-infos, I needed to have a special for-loop, in order to **iterate only over every 3rd element (and by changing the starting point)**.

As you can see below, this can be achieved via a **special slicing technique** (notice the `::` between the start-point "2" and end-point "3"? 😜)

In [79]:
list_of_df[2::3] # we iterate over every 3rd list (and start at the 3rd element // "2" element)
# check: are those the companies? --> yes!

[                       Wed, 4 Jan 2023 06:16:14 +0000
 0               Migros-Genossenschafts-Bund, Zürich\r
 1                                      KPMG, Zurich\r
 2   Global IT-GITR, Deutschschweiz, Stadt Zürich /...
 3                ProPharma Systems AG, Wettingen AG\r
 4                                ETH Zürich, Zurich\r
 5                               Sensirion AG, Stäfa\r
 6                      Sensirion AG, Stäfa, Schweiz\r
 7                           Concordia, Schaffhausen\r
 8                          Sparrow Ventures, Zürich\r
 9                              Biotronik AG, Bülach\r
 10                             Biotronik AG, Bülach\r
 11   Kantonale Verwaltung Zürich, Zürich Altstetten\r
 12              Freestar-Informatik AG, Raum Zürich\r
 13            Flughafen Zürich AG, Zürich-Flughafen\r
 14                   Freestar-Informatik AG, Zürich\r
 15            Flughafen Zürich AG, Zürich-Flughafen\r
 16                Universitätsspital Zürich, Zürich\r
 17       

### How do the separate DFs look like?

To generate the DF with the URLs, I needed to merge some separate DFs together. This is how these separate DFs looked like... 

In [113]:
url_df

Unnamed: 0,Date,url
0,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
1,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
2,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
3,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
4,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
...,...,...
179,2022-12-28 06:10:27+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
180,2022-12-28 06:10:27+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
181,2022-12-28 06:10:27+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
182,2022-12-28 06:10:27+00:00,https://www.jobs.ch/de/stellenangebote/detail/...


In [114]:
title_df

Unnamed: 0,Date,Job Title
0,2023-01-04 06:16:14+00:00,Data Scientist - Optimierung Filialperformance...
1,2023-01-04 06:16:14+00:00,Senior Manager – Lighthouse Data Science\r
2,2023-01-04 06:16:14+00:00,Scientist in Cell Line Development\r
3,2023-01-04 06:16:14+00:00,Pharmabranche: Einstieg in die Software-Entwic...
4,2023-01-04 06:16:14+00:00,Postdoctoral researcher in management and cont...
...,...,...
179,2022-12-28 06:10:27+00:00,Grafik Designer mit Coding und Motion Skills\r
180,2022-12-28 06:10:27+00:00,Web3 Praktikum.\r
181,2022-12-28 06:10:27+00:00,Praktikant Supply Management (m/w)\r
182,2022-12-28 06:10:27+00:00,Praktikum Praxisintegriertes Bachelor-Studium ...


In [115]:
company_df

Unnamed: 0,Date,Employer Company
0,2023-01-04 06:16:14+00:00,"Migros-Genossenschafts-Bund, Zürich\r"
1,2023-01-04 06:16:14+00:00,"KPMG, Zurich\r"
2,2023-01-04 06:16:14+00:00,"Global IT-GITR, Deutschschweiz, Stadt Zürich /..."
3,2023-01-04 06:16:14+00:00,"ProPharma Systems AG, Wettingen AG\r"
4,2023-01-04 06:16:14+00:00,"ETH Zürich, Zurich\r"
...,...,...
179,2022-12-28 06:10:27+00:00,"Digitec, Zürich\r"
180,2022-12-28 06:10:27+00:00,"Energy Schweiz AG, Zürich\r"
181,2022-12-28 06:10:27+00:00,"Sodexo (Suisse) SA, Glattbrugg\r"
182,2022-12-28 06:10:27+00:00,"Swiss Life AG, Zürich\r"


In [130]:
url_df

Unnamed: 0,Date,url
0,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
1,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
2,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
3,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
4,2023-01-04 06:16:14+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
...,...,...
179,2022-12-28 06:10:27+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
180,2022-12-28 06:10:27+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
181,2022-12-28 06:10:27+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
182,2022-12-28 06:10:27+00:00,https://www.jobs.ch/de/stellenangebote/detail/...
