<a href="https://colab.research.google.com/github/gowthambalboa/Hadoop_Data_Crawling_Cleaning/blob/main/data_crawling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We are going to create a web crawler using beautiful soup of python.

In [2]:
from bs4 import BeautifulSoup
import requests

In [3]:
import pandas as pd

In [4]:
import time

First let me try to scrape the first ever page of the big data url.

In [5]:
re = requests.get('http://www.wikicfp.com/cfp/call?conference=big%20data%20&page=1').text

In [6]:
soup = BeautifulSoup(re,'lxml')

Printing the HTML code that was scraped

In [7]:
print(soup.prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
 <head>
  <title>
   Big Data Call For Papers for Conferences, Workshops and Journals at WikiCFP
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Big Data Calls For Papers (CFP) for international conferences, workshops, meetings, seminars, events, journals and book chapters" name="description"/>
  <meta content="INDEX,NOFOLLOW" name="ROBOTS"/>
  <link href="/cfp/styles/wikicfp.css?v=2" rel="stylesheet" type="text/css"/>
  <link href="/cfp/images/wikicfp.ico" rel="shortcut icon"/>
  <script type="text/javascript">
   var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-2351831-1']);
  _gaq.push(['_trackPageview']);
  (function() {
    var ga = document.createElement('script');
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    ga.setAttribute('async', 'true'

Identifying the conference table from the list of all table HTML tags of the scraped code.

In [8]:
tables = soup.find_all('table')
conference_table = tables[5]

In [9]:
conference_table

<table align="center" cellpadding="3" cellspacing="1" width="100%">
<tr align="center" bgcolor="#bbbbbb"><td> Event </td><td> When </td><td> Where </td><td> Deadline</td><td><input name="checkall" onclick="CheckAll()" type="checkbox"/></td></tr><tr bgcolor="#f6f6f6">
<td align="left" rowspan="2"><a href="/cfp/servlet/event.showcfp?eventid=143061&amp;copyownerid=13881">IEEE ICCCBDA--Scopus and Ei Compendex 2022</a></td>
<td align="left" colspan="3">2022 IEEE the 7th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA 2022)--Scopus, EI Compendex</td><td align="center" rowspan="2"><input name="eventid_143061" type="checkbox"/></td>
</tr>
<tr bgcolor="#f6f6f6">
<td align="left">Apr 22, 2022 - Apr 24, 2022</td>
<td align="left">Chengdu, China</td>
<td align="left">Mar 15, 2022</td>
</tr>
<tr bgcolor="#e6e6e6">
<td align="left" rowspan="2"><a href="/cfp/servlet/event.showcfp?eventid=151211&amp;copyownerid=1">IEEE COINS 2022</a></td>
<td align="left" colspan="3">Intern

Create a list of conference abbreviation list that appear in this page.

In [10]:
conf_abb_list = [p.text for p in conference_table.find_all('a')]

In [11]:
conf_abb_list

['IEEE ICCCBDA--Scopus and Ei Compendex 2022',
 'IEEE COINS 2022',
 'ECCS--Ei Compendex, Scopus 2022',
 'BDET--Springer, Ei and Scopus 2022',
 'ICBDR--Ei, Scopus 2022',
 'ITIOT--Ei Compendex, Scopus 2022',
 'IEEE--ICAIBD--Ei and Scopus 2022',
 'CTCCC--Scopus and Ei Compendex 2022',
 'BigTMS&AI@ICCCI  2022',
 'ICISS--Ei Compendex and Scopus 2022',
 'BlockSys--Ei Compendex, SCI, ESCI 2022',
 'IRS-Frontiers 2022',
 'IJASUC 2022',
 'ICDIPV 2022',
 'NMCO 2022',
 'IJDMS 2022',
 'IJASA 2022',
 'ACIJ 2022',
 'NLPD 2022',
 'DMA 2022']

To extract the conference name and location, we are going to iterate through each row. We can skip the first row as it is just the headings for the columns.

In [12]:
tr = conference_table.find_all('tr')
tr = tr[1:]
tr


[<tr bgcolor="#f6f6f6">
 <td align="left" rowspan="2"><a href="/cfp/servlet/event.showcfp?eventid=143061&amp;copyownerid=13881">IEEE ICCCBDA--Scopus and Ei Compendex 2022</a></td>
 <td align="left" colspan="3">2022 IEEE the 7th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA 2022)--Scopus, EI Compendex</td><td align="center" rowspan="2"><input name="eventid_143061" type="checkbox"/></td>
 </tr>, <tr bgcolor="#f6f6f6">
 <td align="left">Apr 22, 2022 - Apr 24, 2022</td>
 <td align="left">Chengdu, China</td>
 <td align="left">Mar 15, 2022</td>
 </tr>, <tr bgcolor="#e6e6e6">
 <td align="left" rowspan="2"><a href="/cfp/servlet/event.showcfp?eventid=151211&amp;copyownerid=1">IEEE COINS 2022</a></td>
 <td align="left" colspan="3">International Conference on Omni-Layer Intelligent Systems Internet of Things IoT | Artificial Intelligence | Machine Learning | Big Data | Blockchain | Edge &amp; Cloud Computing | Se</td><td align="center" rowspan="2"><input name="eventi

Check the length of the rows.

In [13]:
len(tr)

40

In [14]:
td = []
for row in range(len(tr)):
  td.append(tr[row].find_all('td'))
td

[[<td align="left" rowspan="2"><a href="/cfp/servlet/event.showcfp?eventid=143061&amp;copyownerid=13881">IEEE ICCCBDA--Scopus and Ei Compendex 2022</a></td>,
  <td align="left" colspan="3">2022 IEEE the 7th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA 2022)--Scopus, EI Compendex</td>,
  <td align="center" rowspan="2"><input name="eventid_143061" type="checkbox"/></td>],
 [<td align="left">Apr 22, 2022 - Apr 24, 2022</td>,
  <td align="left">Chengdu, China</td>,
  <td align="left">Mar 15, 2022</td>],
 [<td align="left" rowspan="2"><a href="/cfp/servlet/event.showcfp?eventid=151211&amp;copyownerid=1">IEEE COINS 2022</a></td>,
  <td align="left" colspan="3">International Conference on Omni-Layer Intelligent Systems Internet of Things IoT | Artificial Intelligence | Machine Learning | Big Data | Blockchain | Edge &amp; Cloud Computing | Se</td>,
  <td align="center" rowspan="2"><input name="eventid_151211" type="checkbox"/></td>],
 [<td align="left">Aug 1, 20

In [15]:
conf_name_scrape = []
conf_loc_scrape = []
for index, item in enumerate(td):
  if index%2 == 0:
    conf_name_scrape.append(item)
  else:
    conf_loc_scrape.append(item)

conf_name_scrape

[[<td align="left" rowspan="2"><a href="/cfp/servlet/event.showcfp?eventid=143061&amp;copyownerid=13881">IEEE ICCCBDA--Scopus and Ei Compendex 2022</a></td>,
  <td align="left" colspan="3">2022 IEEE the 7th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA 2022)--Scopus, EI Compendex</td>,
  <td align="center" rowspan="2"><input name="eventid_143061" type="checkbox"/></td>],
 [<td align="left" rowspan="2"><a href="/cfp/servlet/event.showcfp?eventid=151211&amp;copyownerid=1">IEEE COINS 2022</a></td>,
  <td align="left" colspan="3">International Conference on Omni-Layer Intelligent Systems Internet of Things IoT | Artificial Intelligence | Machine Learning | Big Data | Blockchain | Edge &amp; Cloud Computing | Se</td>,
  <td align="center" rowspan="2"><input name="eventid_151211" type="checkbox"/></td>],
 [<td align="left" rowspan="2"><a href="/cfp/servlet/event.showcfp?eventid=147658&amp;copyownerid=13881">ECCS--Ei Compendex, Scopus 2022</a></td>,
  <td align="

In [16]:
conf_loc_scrape

[[<td align="left">Apr 22, 2022 - Apr 24, 2022</td>,
  <td align="left">Chengdu, China</td>,
  <td align="left">Mar 15, 2022</td>],
 [<td align="left">Aug 1, 2022 - Aug 3, 2022</td>,
  <td align="left">Barcelona, Spain</td>,
  <td align="left">Mar 30, 2022</td>],
 [<td align="left">May 12, 2022 - May 14, 2022</td>,
  <td align="left">Vienna, Austria</td>,
  <td align="left">Mar 15, 2022</td>],
 [<td align="left">Apr 22, 2022 - Apr 24, 2022</td>,
  <td align="left">Singapore</td>,
  <td align="left">Mar 15, 2022</td>],
 [<td align="left">Aug 10, 2022 - Aug 12, 2022</td>,
  <td align="left">Harbin, China</td>,
  <td align="left">Mar 15, 2022</td>],
 [<td align="left">Apr 22, 2022 - Apr 25, 2022</td>,
  <td align="left">Wuhan, China</td>,
  <td align="left">Mar 15, 2022</td>],
 [<td align="left">May 27, 2022 - May 30, 2022</td>,
  <td align="left">Chengdu, China</td>,
  <td align="left">Mar 15, 2022</td>],
 [<td align="left">Apr 15, 2022 - Apr 17, 2022</td>,
  <td align="left">Beijing, Ch

Create a conference name list to store all names of the conference.

In [17]:
conf_name_list =[]
for li in conf_name_scrape:
  conf_name_list.append(li[1].text)
conf_name_list

['2022 IEEE the 7th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA 2022)--Scopus, EI Compendex',
 'International Conference on Omni-Layer Intelligent Systems Internet of Things IoT | Artificial Intelligence | Machine Learning | Big Data | Blockchain | Edge & Cloud Computing | Se',
 '2022 2nd European Conference on Communication Systems (ECCS 2022)--Ei Compendex, Scopus',
 'Springer--2022 4th International Conference on Big Data Engineering and Technology (BDET 2022)--Scopus, Ei Compendex',
 '2022 The 6th International Conference on Big Data Research (ICBDR 2022)--Ei Compendex, Scopus',
 '2022 The 3rd International Conference on Information Technology and Internet of Things (ITIOT 2022)--EI Compendex, Scopus',
 'IEEE--2022 The 5th International Conference on Artificial Intelligence and Big Data (ICAIBD 2022)--EI Compendex, Scopus',
 '2022 The 3rd Communication Technologies and Cloud Computing Conference (CTCCC 2022)--EI Compendex, Scopus',
 'Special Session 

Create a conference location list to store all locations of the conference

In [18]:
conf_loc_list =[]
for li in conf_loc_scrape:
  conf_loc_list.append(li[1].text)
conf_loc_list

['Chengdu, China',
 'Barcelona, Spain',
 'Vienna, Austria',
 'Singapore',
 'Harbin, China',
 'Wuhan, China',
 'Chengdu, China',
 'Beijing, China',
 'TUNISIA',
 'Beijing, China',
 'Chengdu, China',
 'N/A',
 'N/A',
 'Toronto, Canada',
 'Sydney, Australia',
 'N/A',
 'N/A',
 'N/A',
 'Copenhagen, Denmark',
 'Vancouver, Canada']

In [19]:
len(conf_loc_list)

20

In [20]:
len(conf_name_list)

20

In [21]:
len(conf_abb_list)

20

We can see that we have collected information of 20 conferences in big data from page 1.

We are now going to create a dictionary that will store the column name and the respective list.

In [22]:
d = {'Conference_Acronym':conf_abb_list,'Conference_Name':conf_name_list,'Conference_Location':conf_loc_list}
df = pd.DataFrame(d,columns=['Conference_Acronym','Conference_Name','Conference_Location'])
df

Unnamed: 0,Conference_Acronym,Conference_Name,Conference_Location
0,IEEE ICCCBDA--Scopus and Ei Compendex 2022,2022 IEEE the 7th International Conference on ...,"Chengdu, China"
1,IEEE COINS 2022,International Conference on Omni-Layer Intelli...,"Barcelona, Spain"
2,"ECCS--Ei Compendex, Scopus 2022",2022 2nd European Conference on Communication ...,"Vienna, Austria"
3,"BDET--Springer, Ei and Scopus 2022",Springer--2022 4th International Conference on...,Singapore
4,"ICBDR--Ei, Scopus 2022",2022 The 6th International Conference on Big D...,"Harbin, China"
5,"ITIOT--Ei Compendex, Scopus 2022",2022 The 3rd International Conference on Infor...,"Wuhan, China"
6,IEEE--ICAIBD--Ei and Scopus 2022,IEEE--2022 The 5th International Conference on...,"Chengdu, China"
7,CTCCC--Scopus and Ei Compendex 2022,2022 The 3rd Communication Technologies and Cl...,"Beijing, China"
8,BigTMS&AI@ICCCI 2022,Special Session on Big Text Mining Searching &...,TUNISIA
9,ICISS--Ei Compendex and Scopus 2022,2022 The 5th International Conference on Infor...,"Beijing, China"


In [23]:
df_list = []
df_list.append(df)

As we have successfully collected information from the first page alone, now we can do the same logic for the rest of the pages. The results from the first page are added to dataframe list.

We are going to scrape HTML from page 2 to 20. A timer of 5 seconds is introduced after scraping every page just to not get ourselves blocked from the website to scrape data from them and to not cause DDOS attack.

In [24]:
url = 'http://www.wikicfp.com/cfp/call?conference=big%20data%20&page='
pages_li = []
for page in (range(2,21)):
  re = requests.get(url+str(page)).text
  time.sleep(5) # To introduce a delay of 5 seconds between every request
  soup = BeautifulSoup(re,'lxml')
  pages_li.append(soup)
pages_li


[<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
 <html>
 <head>
 <title>Big Data Call For Papers for Conferences, Workshops and Journals at WikiCFP</title>
 <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
 <meta content="Big Data Calls For Papers (CFP) for international conferences, workshops, meetings, seminars, events, journals and book chapters" name="description"/>
 <meta content="INDEX,NOFOLLOW" name="ROBOTS"/>
 <link href="/cfp/styles/wikicfp.css?v=2" rel="stylesheet" type="text/css"/>
 <link href="/cfp/images/wikicfp.ico" rel="shortcut icon"/>
 <script type="text/javascript">
   var _gaq = _gaq || [];
   _gaq.push(['_setAccount', 'UA-2351831-1']);
   _gaq.push(['_trackPageview']);
   (function() {
     var ga = document.createElement('script');
     ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
     ga.setAttribute('async', 'true');
   

In [25]:
print(pages_li[18].prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
 <head>
  <title>
   Big Data Call For Papers for Conferences, Workshops and Journals at WikiCFP
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Big Data Calls For Papers (CFP) for international conferences, workshops, meetings, seminars, events, journals and book chapters" name="description"/>
  <meta content="INDEX,NOFOLLOW" name="ROBOTS"/>
  <link href="/cfp/styles/wikicfp.css?v=2" rel="stylesheet" type="text/css"/>
  <link href="/cfp/images/wikicfp.ico" rel="shortcut icon"/>
  <script type="text/javascript">
   var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-2351831-1']);
  _gaq.push(['_trackPageview']);
  (function() {
    var ga = document.createElement('script');
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    ga.setAttribute('async', 'true'

In [26]:
print(pages_li[0].prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
 <head>
  <title>
   Big Data Call For Papers for Conferences, Workshops and Journals at WikiCFP
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Big Data Calls For Papers (CFP) for international conferences, workshops, meetings, seminars, events, journals and book chapters" name="description"/>
  <meta content="INDEX,NOFOLLOW" name="ROBOTS"/>
  <link href="/cfp/styles/wikicfp.css?v=2" rel="stylesheet" type="text/css"/>
  <link href="/cfp/images/wikicfp.ico" rel="shortcut icon"/>
  <script type="text/javascript">
   var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-2351831-1']);
  _gaq.push(['_trackPageview']);
  (function() {
    var ga = document.createElement('script');
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    ga.setAttribute('async', 'true'

Now apply the same extraction logic that was used for one page on the remaining list of pages.

In [27]:
for i in range(len(pages_li)):
  tables = pages_li[i].find_all('table')
  conference_table = tables[5]
  # print(conference_table)
  conf_abb_list = [p.text for p in conference_table.find_all('a')]
  print(conf_abb_list)
  tr = conference_table.find_all('tr')
  tr = tr[1:]

  td = []
  for row in range(len(tr)):
    if("Expired" not in tr[row].text):
      td.append(tr[row].find_all('td'))

  conf_name_scrape = []
  conf_loc_scrape = []
  for index, item in enumerate(td):
    if index%2 == 0:
      conf_name_scrape.append(item)
    else:
      conf_loc_scrape.append(item)

  conf_name_list =[]
  for li in conf_name_scrape:
    conf_name_list.append(li[1].text)
  # print(conf_name_list)
  conf_loc_list =[]
  for li in conf_loc_scrape:
    conf_loc_list.append(li[1].text)
  
  d = {'Conference_Acronym':conf_abb_list,'Conference_Name':conf_name_list,'Conference_Location':conf_loc_list}
  df_add = pd.DataFrame(d,columns=['Conference_Acronym','Conference_Name','Conference_Location'])

  df_list.append(df_add)




['BIGML  2022', 'IJITCS 2022', 'IJCSA 2022', 'BDBS 2022', 'CYBI 2022', 'IJCAx 2022', 'IJGTT 2022', 'CBIoT  2022', 'DaKM 2022', 'JARES 2022', 'IJCCSA 2022', 'IJDKP 2022', 'IJPLA  2022', 'ELEN 2022', 'CBDA 2022', 'IJCSES  2022', 'NLPCL 2022', 'ACII 2022', 'IJIST 2022', 'NLPA 2022']
['SoCAV 2022', 'IEEE--DSIT 2022', 'ICINT--EI Compendex, Scopus 2022', 'BDEE--Ei Compendex, Scopus 2022', 'IEEE-Ei/Scopus-CWCBD 2022', 'ICAIT--Ei, Scopus 2022', 'EI/Scopus--DMBDA 2022', 'SEBD 2022', 'ACCSE 2022', 'APWeb-WAIM 2022', 'APWeb-WAIM 2022', 'APWeb-WAIM 2022', 'IMMM 2022', 'SIU - Special Session on CSS 2022', 'ICTC--IEEE, Scopus, Ei 2022', 'ICISDM--ACM, Ei and Scopus 2022', 'BDAP 2022', 'AIAD 2022', 'E&C 2022', 'MLNLP 2022']
['AIFZ 2022', 'NLAI 2022', 'IJSCAI 2022', 'DSML 2022', 'DMDBS 2022', 'MLDS 2022', 'NWCOM 2022', 'IOTCB 2022', 'ICCSEA 2022', 'IJCSITY 2022', 'NATL 2022', 'JCC-BD&ET 2022', 'IDITR-IEEE 2022', 'IEEE COINS  2022', 'ACM--ICBDC--EI Compendex, Scopus 2022', 'ENHANCE 2022', 'EAI IoTCare  

In [28]:
len(df_list) # Remember to clear this list and start from first

20

Performing concat over all the collected list of information to form a single dataframe of required information.

In [29]:
final_df = pd.concat(df_list)


In [31]:
len(final_df)

400

In [32]:
final_df_without_index = final_df.reset_index()

In [33]:
final_df_without_index

Unnamed: 0,index,Conference_Acronym,Conference_Name,Conference_Location
0,0,IEEE ICCCBDA--Scopus and Ei Compendex 2022,2022 IEEE the 7th International Conference on ...,"Chengdu, China"
1,1,IEEE COINS 2022,International Conference on Omni-Layer Intelli...,"Barcelona, Spain"
2,2,"ECCS--Ei Compendex, Scopus 2022",2022 2nd European Conference on Communication ...,"Vienna, Austria"
3,3,"BDET--Springer, Ei and Scopus 2022",Springer--2022 4th International Conference on...,Singapore
4,4,"ICBDR--Ei, Scopus 2022",2022 The 6th International Conference on Big D...,"Harbin, China"
...,...,...,...,...
395,15,Cognitive Computing 2020,Cognitive Computing with Big Data System over ...,
396,16,HCIS 2020,14th International Conference on Human-Centere...,"Split, Croatia"
397,17,TempWeb 2020,The 10th Temporal Web Analytics Workshop 2020 ...,"Taipei, Taiwan"
398,18,IEEE ICALT 2020,20th IEEE International Conference on Advanced...,"Tartu, Estonia"


In [34]:
final_df_without_index.drop(['index'],axis=1,inplace=True)

In [35]:
final_df_without_index

Unnamed: 0,Conference_Acronym,Conference_Name,Conference_Location
0,IEEE ICCCBDA--Scopus and Ei Compendex 2022,2022 IEEE the 7th International Conference on ...,"Chengdu, China"
1,IEEE COINS 2022,International Conference on Omni-Layer Intelli...,"Barcelona, Spain"
2,"ECCS--Ei Compendex, Scopus 2022",2022 2nd European Conference on Communication ...,"Vienna, Austria"
3,"BDET--Springer, Ei and Scopus 2022",Springer--2022 4th International Conference on...,Singapore
4,"ICBDR--Ei, Scopus 2022",2022 The 6th International Conference on Big D...,"Harbin, China"
...,...,...,...
395,Cognitive Computing 2020,Cognitive Computing with Big Data System over ...,
396,HCIS 2020,14th International Conference on Human-Centere...,"Split, Croatia"
397,TempWeb 2020,The 10th Temporal Web Analytics Workshop 2020 ...,"Taipei, Taiwan"
398,IEEE ICALT 2020,20th IEEE International Conference on Advanced...,"Tartu, Estonia"


Bigdata_df is the final desired dataframe having 400 rows of conference information.

In [36]:
bigdata_df = final_df_without_index

In [37]:
bigdata_df.head()

Unnamed: 0,Conference_Acronym,Conference_Name,Conference_Location
0,IEEE ICCCBDA--Scopus and Ei Compendex 2022,2022 IEEE the 7th International Conference on ...,"Chengdu, China"
1,IEEE COINS 2022,International Conference on Omni-Layer Intelli...,"Barcelona, Spain"
2,"ECCS--Ei Compendex, Scopus 2022",2022 2nd European Conference on Communication ...,"Vienna, Austria"
3,"BDET--Springer, Ei and Scopus 2022",Springer--2022 4th International Conference on...,Singapore
4,"ICBDR--Ei, Scopus 2022",2022 The 6th International Conference on Big D...,"Harbin, China"


Creating another column called conference_year by extracting the year from the acronym column

In [38]:

bigdata_df['Conference_Acronym'] = bigdata_df['Conference_Acronym'].str.strip()

In [39]:
bigdata_df['Conference_year'] = bigdata_df['Conference_Acronym'].str.split().str[-1]

In [40]:
bigdata_df

Unnamed: 0,Conference_Acronym,Conference_Name,Conference_Location,Conference_year
0,IEEE ICCCBDA--Scopus and Ei Compendex 2022,2022 IEEE the 7th International Conference on ...,"Chengdu, China",2022
1,IEEE COINS 2022,International Conference on Omni-Layer Intelli...,"Barcelona, Spain",2022
2,"ECCS--Ei Compendex, Scopus 2022",2022 2nd European Conference on Communication ...,"Vienna, Austria",2022
3,"BDET--Springer, Ei and Scopus 2022",Springer--2022 4th International Conference on...,Singapore,2022
4,"ICBDR--Ei, Scopus 2022",2022 The 6th International Conference on Big D...,"Harbin, China",2022
...,...,...,...,...
395,Cognitive Computing 2020,Cognitive Computing with Big Data System over ...,,2020
396,HCIS 2020,14th International Conference on Human-Centere...,"Split, Croatia",2020
397,TempWeb 2020,The 10th Temporal Web Analytics Workshop 2020 ...,"Taipei, Taiwan",2020
398,IEEE ICALT 2020,20th IEEE International Conference on Advanced...,"Tartu, Estonia",2020


Setting anomalous data to a proper data.

In [41]:
bigdata_df.at[156,'Conference_year']=2021

In [42]:
bigdata_df

Unnamed: 0,Conference_Acronym,Conference_Name,Conference_Location,Conference_year
0,IEEE ICCCBDA--Scopus and Ei Compendex 2022,2022 IEEE the 7th International Conference on ...,"Chengdu, China",2022
1,IEEE COINS 2022,International Conference on Omni-Layer Intelli...,"Barcelona, Spain",2022
2,"ECCS--Ei Compendex, Scopus 2022",2022 2nd European Conference on Communication ...,"Vienna, Austria",2022
3,"BDET--Springer, Ei and Scopus 2022",Springer--2022 4th International Conference on...,Singapore,2022
4,"ICBDR--Ei, Scopus 2022",2022 The 6th International Conference on Big D...,"Harbin, China",2022
...,...,...,...,...
395,Cognitive Computing 2020,Cognitive Computing with Big Data System over ...,,2020
396,HCIS 2020,14th International Conference on Human-Centere...,"Split, Croatia",2020
397,TempWeb 2020,The 10th Temporal Web Analytics Workshop 2020 ...,"Taipei, Taiwan",2020
398,IEEE ICALT 2020,20th IEEE International Conference on Advanced...,"Tartu, Estonia",2020


In [44]:
bigdata_df.head()

Unnamed: 0,Conference_Acronym,Conference_Name,Conference_Location,Conference_year
0,IEEE ICCCBDA--Scopus and Ei Compendex 2022,2022 IEEE the 7th International Conference on ...,"Chengdu, China",2022
1,IEEE COINS 2022,International Conference on Omni-Layer Intelli...,"Barcelona, Spain",2022
2,"ECCS--Ei Compendex, Scopus 2022",2022 2nd European Conference on Communication ...,"Vienna, Austria",2022
3,"BDET--Springer, Ei and Scopus 2022",Springer--2022 4th International Conference on...,Singapore,2022
4,"ICBDR--Ei, Scopus 2022",2022 The 6th International Conference on Big D...,"Harbin, China",2022


In [45]:
len(bigdata_df)

400

Dropping duplicate rows from the data.

In [46]:
bigdata_df.drop_duplicates(inplace=True)

In [47]:
len(bigdata_df)

393

Dowloading the file as a csv to clean the data in openrefine.

In [48]:
bigdata_df.to_csv('bigdata_df_final.csv',index=False)

In [50]:
from google.colab import files
files.download('bigdata_df_final.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Let's do the same for Machine Learning and Artificial Intelligence conferences

In [51]:
url_ml = 'http://www.wikicfp.com/cfp/call?conference=machine%20learning&page='
url_ai = 'http://www.wikicfp.com/cfp/call?conference=artificial%20intelligence&page='
pages_li_ml = []
pages_li_ai = []
for page in (range(1,21)):
  re_ml = requests.get(url_ml+str(page)).text
  re_ai = requests.get(url_ai+str(page)).text
  time.sleep(5) # To introduce a delay of 5 seconds between every request
  soup_ml = BeautifulSoup(re_ml,'lxml')
  soup_ai = BeautifulSoup(re_ai,'lxml')
  pages_li_ml.append(soup_ml)
  pages_li_ai.append(soup_ai)

In [52]:
len(pages_li_ml)

20

In [53]:
len(pages_li_ai)

20

In [56]:
df_list = []
for i in range(len(pages_li_ml)):
  tables = pages_li_ml[i].find_all('table')
  conference_table = tables[5]
  # print(conference_table)
  conf_abb_list = [p.text for p in conference_table.find_all('a')]
  print(conf_abb_list)
  tr = conference_table.find_all('tr')
  tr = tr[1:]

  td = []
  for row in range(len(tr)):
    if("Expired" not in tr[row].text):
      td.append(tr[row].find_all('td'))

  conf_name_scrape = []
  conf_loc_scrape = []
  for index, item in enumerate(td):
    if index%2 == 0:
      conf_name_scrape.append(item)
    else:
      conf_loc_scrape.append(item)

  conf_name_list =[]
  for li in conf_name_scrape:
    conf_name_list.append(li[1].text)
  # print(conf_name_list)
  conf_loc_list =[]
  for li in conf_loc_scrape:
    conf_loc_list.append(li[1].text)
  
  d = {'Conference_Acronym':conf_abb_list,'Conference_Name':conf_name_list,'Conference_Location':conf_loc_list}
  df_add = pd.DataFrame(d,columns=['Conference_Acronym','Conference_Name','Conference_Location'])

  df_list.append(df_add)

['MDAI 2022', 'IEEE COINS 2022', 'ICCCI 2022 - ML-SDA 2022', 'Sensors-SI-ISHMA 2021', 'EI Compendex, Scopus-CDIVP 2022', 'ICIST--Ei, Scopus 2022', 'ICCCI 2022 - Innov-Healthcare 2022', 'AIBSAT  2022', 'ICIBM 2022', 'HPlan 2022', 'IJGTT 2022', 'IJMA 2022', 'MLCL 2022', 'AMA 2022', 'NATAP 2022', 'IJITCS 2022', 'IJIT 2022', 'BIOSE 2022', 'NLPML 2022', 'CSEIJ 2022']
['CAIML 2022', 'IJANS 2022', 'IJWesT 2022', 'CiVEJ 2022', 'ArIT 2022', 'ICAIT  2022', 'DMML 2022', 'IJCGA 2022', 'JARES 2022', 'EEIJ 2022', 'ARIN 2022', 'BIGML  2022', 'MSEJ 2022', 'SNLP 2022', 'NLPI 2022', 'IJCSES  2022', 'NLCA 2022', 'IJU 2022', 'CSEIT 2022', 'ACII 2022']
['IJMNCT 2022', 'CMIT  2022', 'IJCSEA 2022', 'SAIM 2022', 'ECTIJ 2022', 'NLPCL 2022', 'AIAPP 2022', 'CMIT 2022', 'ISPR 2022', 'IJITCA 2022', 'IJCI 2022', 'NLPD 2022', 'JEDT 2022', 'ICDIPV 2022', 'IJCSA 2022', 'GRAPH-HOC 2022', 'IJAIA 2022', 'SAIM 2022', 'JANT 2022', 'ACIJ 2022']
['CMLA 2022', 'NMCO 2022', 'IEEE-ISNCC 2022', 'CD-MAKE 2022', 'ACM--SPML--EI Com

In [57]:
len(df_list)

20

In [58]:
final_df_ml = pd.concat(df_list)


In [59]:
final_df_ml

Unnamed: 0,Conference_Acronym,Conference_Name,Conference_Location
0,MDAI 2022,19th Modeling Decisions for Artificial Intelli...,"Sant Cugat, Barcelona"
1,IEEE COINS 2022,International Conference on Omni-Layer Intelli...,"Barcelona, Spain"
2,ICCCI 2022 - ML-SDA 2022,Special Session on Machine Learning for Social...,Hammamet - Tunisia
3,Sensors-SI-ISHMA 2021,"[Sensors, IF 3.275] Special Issue on Intellige...",
4,"EI Compendex, Scopus-CDIVP 2022",2022 2nd International Conference on Digital I...,"Shanghai, China"
...,...,...,...
15,IDSTA 2021,The International Conference on Intelligent Da...,"Tartu, Estonia"
16,ICBK 2021,The 12th IEEE International Conference on Big ...,"Auckland, New Zealand"
17,ICMLA AML-IoT FLAME 2021,IEEE Int'l Conf. on Machine Learning and Appli...,"Pasadena, California"
18,RiTA 2021,9th International Conference on Robot Intellig...,KAIST Daejeon + Virtual Access


In [60]:
final_df_without_index_ml = final_df_ml.reset_index()

In [61]:
final_df_without_index_ml.drop(['index'],axis=1,inplace=True)

In [62]:
ml_df = final_df_without_index_ml

In [63]:
ml_df['Conference_Acronym'] = ml_df['Conference_Acronym'].str.strip()
ml_df['Conference_year'] = ml_df['Conference_Acronym'].str.split().str[-1]

In [64]:
ml_df

Unnamed: 0,Conference_Acronym,Conference_Name,Conference_Location,Conference_year
0,MDAI 2022,19th Modeling Decisions for Artificial Intelli...,"Sant Cugat, Barcelona",2022
1,IEEE COINS 2022,International Conference on Omni-Layer Intelli...,"Barcelona, Spain",2022
2,ICCCI 2022 - ML-SDA 2022,Special Session on Machine Learning for Social...,Hammamet - Tunisia,2022
3,Sensors-SI-ISHMA 2021,"[Sensors, IF 3.275] Special Issue on Intellige...",,2021
4,"EI Compendex, Scopus-CDIVP 2022",2022 2nd International Conference on Digital I...,"Shanghai, China",2022
...,...,...,...,...
395,IDSTA 2021,The International Conference on Intelligent Da...,"Tartu, Estonia",2021
396,ICBK 2021,The 12th IEEE International Conference on Big ...,"Auckland, New Zealand",2021
397,ICMLA AML-IoT FLAME 2021,IEEE Int'l Conf. on Machine Learning and Appli...,"Pasadena, California",2021
398,RiTA 2021,9th International Conference on Robot Intellig...,KAIST Daejeon + Virtual Access,2021


In [65]:
ml_df.drop_duplicates(inplace=True)

In [66]:
len(ml_df)

391

In [67]:
ml_df

Unnamed: 0,Conference_Acronym,Conference_Name,Conference_Location,Conference_year
0,MDAI 2022,19th Modeling Decisions for Artificial Intelli...,"Sant Cugat, Barcelona",2022
1,IEEE COINS 2022,International Conference on Omni-Layer Intelli...,"Barcelona, Spain",2022
2,ICCCI 2022 - ML-SDA 2022,Special Session on Machine Learning for Social...,Hammamet - Tunisia,2022
3,Sensors-SI-ISHMA 2021,"[Sensors, IF 3.275] Special Issue on Intellige...",,2021
4,"EI Compendex, Scopus-CDIVP 2022",2022 2nd International Conference on Digital I...,"Shanghai, China",2022
...,...,...,...,...
394,IEEE CIIoT 2021,IEEE Symposium on Computational Intelligence i...,"Orlando, Florida, USA",2021
395,IDSTA 2021,The International Conference on Intelligent Da...,"Tartu, Estonia",2021
396,ICBK 2021,The 12th IEEE International Conference on Big ...,"Auckland, New Zealand",2021
397,ICMLA AML-IoT FLAME 2021,IEEE Int'l Conf. on Machine Learning and Appli...,"Pasadena, California",2021


In [68]:
ml_df.at[128,'Conference_year']=2022
ml_df.at[384,'Conference_year']=2021
ml_df.at[345,'Conference_year']=2021


In [69]:
ml_df

Unnamed: 0,Conference_Acronym,Conference_Name,Conference_Location,Conference_year
0,MDAI 2022,19th Modeling Decisions for Artificial Intelli...,"Sant Cugat, Barcelona",2022
1,IEEE COINS 2022,International Conference on Omni-Layer Intelli...,"Barcelona, Spain",2022
2,ICCCI 2022 - ML-SDA 2022,Special Session on Machine Learning for Social...,Hammamet - Tunisia,2022
3,Sensors-SI-ISHMA 2021,"[Sensors, IF 3.275] Special Issue on Intellige...",,2021
4,"EI Compendex, Scopus-CDIVP 2022",2022 2nd International Conference on Digital I...,"Shanghai, China",2022
...,...,...,...,...
394,IEEE CIIoT 2021,IEEE Symposium on Computational Intelligence i...,"Orlando, Florida, USA",2021
395,IDSTA 2021,The International Conference on Intelligent Da...,"Tartu, Estonia",2021
396,ICBK 2021,The 12th IEEE International Conference on Big ...,"Auckland, New Zealand",2021
397,ICMLA AML-IoT FLAME 2021,IEEE Int'l Conf. on Machine Learning and Appli...,"Pasadena, California",2021


In [70]:
len(ml_df)

391

In [71]:
ml_df.to_csv('ml_db.csv',index=False)

In [72]:
files.download('ml_db.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [73]:
df_list = []
for i in range(len(pages_li_ai)):
  tables = pages_li_ai[i].find_all('table')
  conference_table = tables[5]
  # print(conference_table)
  conf_abb_list = [p.text for p in conference_table.find_all('a')]
  print(conf_abb_list)
  tr = conference_table.find_all('tr')
  tr = tr[1:]

  td = []
  for row in range(len(tr)):
    if("Expired" not in tr[row].text):
      td.append(tr[row].find_all('td'))

  conf_name_scrape = []
  conf_loc_scrape = []
  for index, item in enumerate(td):
    if index%2 == 0:
      conf_name_scrape.append(item)
    else:
      conf_loc_scrape.append(item)

  conf_name_list =[]
  for li in conf_name_scrape:
    conf_name_list.append(li[1].text)
  # print(conf_name_list)
  conf_loc_list =[]
  for li in conf_loc_scrape:
    conf_loc_list.append(li[1].text)
  
  d = {'Conference_Acronym':conf_abb_list,'Conference_Name':conf_name_list,'Conference_Location':conf_loc_list}
  df_add = pd.DataFrame(d,columns=['Conference_Acronym','Conference_Name','Conference_Location'])

  df_list.append(df_add)

['ICSI 2022', 'ISAI--Ei Compendex, Scopus 2022', 'IEEE--ICAIBD--Ei and Scopus 2022', 'WSEA--Ei Compendex, Scopus 2022', 'ICIST--Ei Compendex, Scopus 2022', 'CTCCC--Scopus and Ei Compendex 2022', 'ICVR--IEEE, Ei, Scopus 2022', 'SNTA 2022', 'KSEM 2022', 'ICMIMT--IEEE, Ei and Scopus 2022', 'ACM--ITCC--Ei Compendex and Scopus 2022', 'EI Compendex, Scopus-EECT 2022', 'ACM--ICCTA--Ei Compendex, Scopus 2022', 'SOCS 2022', 'IVUS 2022', 'DAPSPAC 2022', 'NLPD 2022', 'IJPS 2022', 'JANT 2022', 'CoSIT 2022']
['IJU 2022', 'ICDIPV 2022', 'ArIT 2022', 'AIAPP 2022', 'JEDT 2022', ' IJBES 2022', 'ACIJ 2022', 'CIoT 2022', 'IJDKP 2022', 'JARES 2022', 'IJMNCT 2022', 'SAIM 2022', 'CAIML 2022', 'IJITCS 2022', 'AMA 2022', 'AVC 2022', 'MLAIJ 2022', 'IJMA 2022', 'NMCO 2022', 'CSEIJ 2022']
['MATE 2022', 'IJAIA 2022', 'NATAP 2022', 'ARIN 2022', 'IJANS 2022', 'CiVEJ 2022', 'FCST  2022', 'ESIJ 2022', 'GRAPH-HOC 2022', 'ICAIT  2022', 'NATAP 2022', 'IJCSES  2022', 'NLPCL 2022', 'SAIM 2022', 'ACII 2022', 'ECTIJ 2022', 

In [74]:
final_df_ai = pd.concat(df_list)
final_df_ai

Unnamed: 0,Conference_Acronym,Conference_Name,Conference_Location
0,ICSI 2022,International Conference on Swarm Intelligence,"Xi'an, China"
1,"ISAI--Ei Compendex, Scopus 2022",2022 the 2nd International Symposium on AI (IS...,"Chengdu, China"
2,IEEE--ICAIBD--Ei and Scopus 2022,IEEE--2022 The 5th International Conference on...,"Chengdu, China"
3,"WSEA--Ei Compendex, Scopus 2022",2022 The 2nd International Workshop on Softwar...,"Chengdu, China"
4,"ICIST--Ei Compendex, Scopus 2022",2022 The 4th International Conference on Intel...,"Harbin, China"
...,...,...,...
15,ISCC 2022,The 27th IEEE Symposium on Computers and Commu...,"Rhodes Island, Greece"
16,MIPRO 2022,45th Jubilee International Convention on Infor...,"Opatija, Croatia"
17,ICPRS 2022,12th International Conference on Pattern Recog...,"St Etienne, France"
18,MLHMI--Ei and Scopus 2022,2022 3rd International Conference on Machine L...,Singapore


In [75]:
final_df_without_index_ai = final_df_ai.reset_index()

In [76]:
final_df_without_index_ai.drop(['index'],axis=1,inplace=True)

In [77]:
ai_df = final_df_without_index_ai
ai_df

Unnamed: 0,Conference_Acronym,Conference_Name,Conference_Location
0,ICSI 2022,International Conference on Swarm Intelligence,"Xi'an, China"
1,"ISAI--Ei Compendex, Scopus 2022",2022 the 2nd International Symposium on AI (IS...,"Chengdu, China"
2,IEEE--ICAIBD--Ei and Scopus 2022,IEEE--2022 The 5th International Conference on...,"Chengdu, China"
3,"WSEA--Ei Compendex, Scopus 2022",2022 The 2nd International Workshop on Softwar...,"Chengdu, China"
4,"ICIST--Ei Compendex, Scopus 2022",2022 The 4th International Conference on Intel...,"Harbin, China"
...,...,...,...
395,ISCC 2022,The 27th IEEE Symposium on Computers and Commu...,"Rhodes Island, Greece"
396,MIPRO 2022,45th Jubilee International Convention on Infor...,"Opatija, Croatia"
397,ICPRS 2022,12th International Conference on Pattern Recog...,"St Etienne, France"
398,MLHMI--Ei and Scopus 2022,2022 3rd International Conference on Machine L...,Singapore


In [78]:
ai_df['Conference_Acronym'] = ai_df['Conference_Acronym'].str.strip()
ai_df['Conference_year'] = ai_df['Conference_Acronym'].str.split().str[-1]

In [79]:
ai_df

Unnamed: 0,Conference_Acronym,Conference_Name,Conference_Location,Conference_year
0,ICSI 2022,International Conference on Swarm Intelligence,"Xi'an, China",2022
1,"ISAI--Ei Compendex, Scopus 2022",2022 the 2nd International Symposium on AI (IS...,"Chengdu, China",2022
2,IEEE--ICAIBD--Ei and Scopus 2022,IEEE--2022 The 5th International Conference on...,"Chengdu, China",2022
3,"WSEA--Ei Compendex, Scopus 2022",2022 The 2nd International Workshop on Softwar...,"Chengdu, China",2022
4,"ICIST--Ei Compendex, Scopus 2022",2022 The 4th International Conference on Intel...,"Harbin, China",2022
...,...,...,...,...
395,ISCC 2022,The 27th IEEE Symposium on Computers and Commu...,"Rhodes Island, Greece",2022
396,MIPRO 2022,45th Jubilee International Convention on Infor...,"Opatija, Croatia",2022
397,ICPRS 2022,12th International Conference on Pattern Recog...,"St Etienne, France",2022
398,MLHMI--Ei and Scopus 2022,2022 3rd International Conference on Machine L...,Singapore,2022


Remove duplicates

In [80]:
ai_df.drop_duplicates(inplace=True)

In [81]:
len(ai_df)

394

In [82]:
ai_df

Unnamed: 0,Conference_Acronym,Conference_Name,Conference_Location,Conference_year
0,ICSI 2022,International Conference on Swarm Intelligence,"Xi'an, China",2022
1,"ISAI--Ei Compendex, Scopus 2022",2022 the 2nd International Symposium on AI (IS...,"Chengdu, China",2022
2,IEEE--ICAIBD--Ei and Scopus 2022,IEEE--2022 The 5th International Conference on...,"Chengdu, China",2022
3,"WSEA--Ei Compendex, Scopus 2022",2022 The 2nd International Workshop on Softwar...,"Chengdu, China",2022
4,"ICIST--Ei Compendex, Scopus 2022",2022 The 4th International Conference on Intel...,"Harbin, China",2022
...,...,...,...,...
395,ISCC 2022,The 27th IEEE Symposium on Computers and Commu...,"Rhodes Island, Greece",2022
396,MIPRO 2022,45th Jubilee International Convention on Infor...,"Opatija, Croatia",2022
397,ICPRS 2022,12th International Conference on Pattern Recog...,"St Etienne, France",2022
398,MLHMI--Ei and Scopus 2022,2022 3rd International Conference on Machine L...,Singapore,2022


In [83]:
ai_df.at[180,'Conference_year']=2022
ai_df.at[382,'Conference_year']=2022

In [84]:
ai_df

Unnamed: 0,Conference_Acronym,Conference_Name,Conference_Location,Conference_year
0,ICSI 2022,International Conference on Swarm Intelligence,"Xi'an, China",2022
1,"ISAI--Ei Compendex, Scopus 2022",2022 the 2nd International Symposium on AI (IS...,"Chengdu, China",2022
2,IEEE--ICAIBD--Ei and Scopus 2022,IEEE--2022 The 5th International Conference on...,"Chengdu, China",2022
3,"WSEA--Ei Compendex, Scopus 2022",2022 The 2nd International Workshop on Softwar...,"Chengdu, China",2022
4,"ICIST--Ei Compendex, Scopus 2022",2022 The 4th International Conference on Intel...,"Harbin, China",2022
...,...,...,...,...
395,ISCC 2022,The 27th IEEE Symposium on Computers and Commu...,"Rhodes Island, Greece",2022
396,MIPRO 2022,45th Jubilee International Convention on Infor...,"Opatija, Croatia",2022
397,ICPRS 2022,12th International Conference on Pattern Recog...,"St Etienne, France",2022
398,MLHMI--Ei and Scopus 2022,2022 3rd International Conference on Machine L...,Singapore,2022


In [85]:
ai_df.to_csv('ai_db.csv',index=False)

In [86]:
files.download('ai_db.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [97]:
from google.colab import files
files.download('bigdata_db.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>