# Structure and Text in Wikipedia

In this assignment, you will use regular expressions to process the source wiki of Wikipedia articles with the purpose of:

1. Extracting useful information from structures such as piped links and category links.
2. Extract the table of contents of the article.
2. Extract a clean version of the text that can be used for NLP.
    - This is done by removing references, infoboxes, pictures, and categories.
    - Piped links *[[string1|string2]]* would need to be replaced with the surface string *string2*.

## Write Your Name Here: Shreyas Yermal Lokesha
###### UNCC Id: 801210964
###### e-mail id: slokesha@uncc.edu

# <font color="blue"> Submission Instructions</font>

1. Click the Save button at the top of the Jupyter Notebook.
2. Please make sure to have entered your name above.
3. Select Cell -> All Output -> Clear. This will clear all the outputs from all cells (but will keep the content of ll cells). 
4. Select Cell -> Run All. This will run all the cells in order, and will take several minutes.
5. Once you've rerun everything, select File -> Download as -> PDF via LaTeX and download a PDF version *wikipedia.pdf* showing the code and the output of all cells, and save it in the same folder that contains the notebook file *wikipedia.ipynb*.
6. Look at the PDF file and make sure all your solutions are there, displayed correctly. The PDF is the only thing we will see when grading!
7. Submit **both** your PDF and notebook on Canvas.

Read the source of the Wikipedia article. For debugging purposes, you may consider using a shorter article first.

In [1]:
# You can use this shorter article first. 
# source = open('../data/FM-2030.txt', 'r').read()

source = open('../data/University_of_North_Carolina_at_Charlotte.txt', 'r', encoding = "utf-8").read()

**Task 1.a** Design a regular expression *piped* that matches piped strings of the type *[[string1|string2]]*. Use parantheses to group string1 and string2, such that *piped.findall(source)* returns them in a tuple *(string1, string2)*. For example, when run on the source of the UNCC article, the code below should result in a list that starts as *[('Public university', 'Public'), ('University of North Carolina', 'UNC System'), ...]* and that contains 44 elements.

In [2]:
import re

#source = open('../data/FM-2030.txt', 'r', encoding = 'utf-8').read()
#piped_re = re.compile(r'(\[+[\s,/\.\-\(\)\w\s]+)\|([\s,/\.\-\(\)\w\s]+]\]+)')
piped_re = re.compile(r'\[\[([^\[\]\|]*?)\|([^\[\]\|]*?)\]\]')
mp = piped_re.findall(source)
print('Found', len(mp), 'piped links.')
print(mp)

Found 44 piped links.
[('Public university', 'Public'), ('University of North Carolina', 'UNC System'), ('Charlotte, North Carolina', 'Charlotte'), ('University City (Charlotte neighborhood)', 'University City'), ('Conference USA', 'C-USA'), ('Charlotte 49ers', '49ers'), ('Association of Public and Land-Grant Universities', 'APLU'), ('Coalition of Urban and Metropolitan Universities', 'CUMU'), ('Oak Ridge Associated Universities', 'ORAU'), ('public education', 'public'), ('Carnegie Classification of Institutions of Higher Education', 'classified'), ('University City, North Carolina', 'University City'), ('Raleigh, North Carolina', 'Raleigh'), ('University of North Carolina School of Medicine', 'existing two-year school'), ('University of North Carolina at Chapel Hill', 'UNC-Chapel Hill'), ('post–World War II baby boom', 'post–World War II'), ('Cabarrus County, North Carolina', 'Cabarrus County'), ('James H Woodward', 'James H. Woodward'), ('James H Woodward', 'James H. Woodward'), ('Pr

**Task 1.b** Design a regular expression *categ* that matches category strings of the type *[[Category:name]]*. Use parantheses to group the name part. When run on the source of the UNCC article, the code below should result in a list that starts as *['University of North Carolina at Charlotte| ', 'Educational institutions established in 1946', ...]* and that contains 6 elements.

In [3]:
categ_re = re.compile(r"\[\[Category:(.*)\]\]")
mc = categ_re.findall(source)
print('Found', len(mc), 'categories.')
print(mc)

Found 6 categories.
['University of North Carolina at Charlotte| ', 'Educational institutions established in 1946', 'Universities and colleges in Charlotte, North Carolina', 'Public universities and colleges in North Carolina|University of North Carolina at Charlotte', 'Universities and colleges accredited by the Southern Association of Colleges and Schools', '1946 establishments in North Carolina']


**Task 2** Extract the table of contents of the article, i.e. a list of all the section titles in the article. When run on the UNCC article, it should find 33 section titles.

In [4]:
title_re = re.compile(r'==+(.+?)==+')
#==+[\s\w\s,\.\']+==+|==+(?=\w+-\w+)==+
mt = title_re.findall(source)
print('found', len(mt), 'titles.')
for title in mt:
    print(title) 

found 33 titles.
History
Leaders of the university
Bonnie Ethel Cone, founder
Chancellors
 2019 shooting 
Campuses
Main Campus – University City
Charlotte Research Institute Campus
Center City Campus
Students
Academics
Colleges and programs
Scholarships
Library system
Athletics
Men's basketball
Women's basketball
Baseball
Football
Golf
Men's soccer
Track and field
Volleyball
Student organizations
University name
Transportation on campus
Niner Transit
Light rail
CATS buses
Notable alumni and faculty
 See also 
References
External links


**Task 3.a** Design a regular expression *ref_re* that matches reference strings enclosed between reference tags "<ref ...> ... <\/ref>" so that they can be eliminated from the document. Beware also of the alternative form "<ref ...\/>".

In [5]:
ref_re = re.compile(r'<ref.+?>(.+?)</ref>|<ref.+?/>|<ref>.+?</ref>|<ref>.+?')
mref = ref_re.findall(source)
print('found', len(mref), 'references.')
for ref in mref:
    print(ref)

found 45 references.
{{cite web|url=http://publicrelations.uncc.edu/information-media-kit/university-history|title=University History - Office of News and Information - UNC Charlotte|website=publicrelations.uncc.edu|access-date=May 11, 2017}}


{{cite web|url=https://admissions.uncc.edu/about-unc-charlotte/university-profile|title=UNIVERSITY PROFILE|website=admissions.uncc.edu|access-date=October 11, 2020}}



{{cite web|url=http://publicrelations.uncc.edu/sites/publicrelations.uncc.edu/files/media/factsheet_march%202012.pdf|title=Faculty|access-date=May 11, 2017|url-status=dead|archive-url=https://web.archive.org/web/20150914165430/http://publicrelations.uncc.edu/sites/publicrelations.uncc.edu/files/media/factsheet_march%202012.pdf|archive-date=September 14, 2015|df=mdy-all}}
) is a [[public education|public]] [[research university]] in [[Charlotte, North Carolina]]. UNC Charlotte offers 23 doctoral, 64 master's, and 140 bachelor's degree programs through nine colleges: the College of

Remove all references from the source string.

In [6]:
source = ref_re.sub("",source)
print(source)

{{Use mdy dates|date=October 2011}}
{{Infobox university
| name = The University of North Carolina at Charlotte
| native_name = 
| latin_name = 
| image = UNC Charlotte seal.png
| image_upright = .7
| motto = 
| established = {{start date and age|1946}}
| type = [[Public university|Public]]
| parent = [[University of North Carolina|UNC System]]
| endowment = $230.35 million (2019)
| staff = 
| faculty = 1,456
| president = 
| provost = Joan Lorden
| principal = 
| rector = 
| chancellor = Sharon Gaber
| vice_chancellor = Kevin Bailey
| dean = Christine Reed Davis
| head_label = 
| head = 
| students = 30,146 (Fall 2020)
| undergrad = 24,175 (Fall 2020)
| postgrad = 5,971 (Fall 2020)
| city = [[Charlotte, North Carolina|Charlotte]]
| state = [[North Carolina]]
| country = United States
| campus = [[University City (Charlotte neighborhood)|University City]]<br />{{convert|1000|acre|km2|1|abbr=on}}
| former_names = Charlotte Center of the University of North Carolina (1946–1949)<br />Char

**Task 3.b** Replace all piped links [[string1|string2]] and [[string2]] with the surface string string2.

In [7]:
pip_re = re.compile(r'\[\[[^\[\]\|]*?\|([^\[\]\|]*?)\]\]')
pip_re2 = re.compile(r'\[\[([^\[\]\|]*?)\]\]')
pre_StringExtract = re.compile(r'\[\[[^\[\]\|]*?\|[^\[\]\|]*?\]\]')
pre2_StringExtract = re.compile(r'\[\[[^\[\]\|]*?\]\]')
mpip = pip_re.findall(source)
mpip2 = pip_re2.findall(source)
pre_StExtArr = pre_StringExtract.findall(source)
pre2_StExtArr = pre2_StringExtract.findall(source)
text = ''
print('found', len(mpip), 'piped links.')
for pip in mpip:
    print(pip)

print('found', len(mpip2), 'piped links2.')
for pip in mpip2:
    print(pip)

x = 'Category:'
for i in range(len(mpip)):
    if(x not in mpip[i] and x not in pre_StExtArr[i]):
        source = source.replace(pre_StExtArr[i],mpip[i])

for i in range(len(mpip2)):
    if(x not in mpip2[i] and x not in pre2_StExtArr[i]):
        source = source.replace(pre2_StExtArr[i],mpip2[i])
    

        
#source = pip_re.sub(mpip, source)
print(source)

found 40 piped links.
Public
UNC System
Charlotte
University City
C-USA
49ers
APLU
CUMU
ORAU
University City
Raleigh
existing two-year school
UNC-Chapel Hill
James H. Woodward
James H. Woodward
Provost
N.C. Highway 49
University City
man-made lakes
first ward of Uptown Charlotte
LYNX Blue Line Extension
classified
doctoral
fine
NCAA Tournament
NBA
N.C. State
NCAA Tournament
Jason Stanford
Buffalo Bulls
Bahamas Bowl
Nate Davis
Cameron Clark
College Cup
MLS
Lee Rose
Jeff Mullins
Bobby Lutz
 
University of North Carolina at Charlotte
found 97 piped links2.
North Carolina
NCAA Division I
Norm the Niner
University of North Carolina at Chapel Hill
North Carolina State University
North Carolina
North Carolina Community College System
University of North Carolina
Bonnie Ethel Cone
Dean W. Colvard
E.K. Fretwell
Philip L. Dubois
Sharon Gaber
Bonnie Ethel Cone
Dean W. Colvard
Mississippi State University
Loyola University Chicago
Emeritus
E.K. Fretwell
University of Massachusetts
University of No

**Task 3.c** Design a regular expression file_re that matches file strings of the type *[[File: ...]]*. Use the regular expression to remove all file strings from the source.

In [8]:
file_re = re.compile(r'\[\[File:.*\]\]+')
mfile = file_re.findall(source)
print('found', len(mfile), 'file links.')
for file in mfile:
    print(file)

source = file_re.sub('', source)
print(source)

found 6 file links.
[[File:Ucity.jpg|thumb|250px|Aerial view of UNC Charlotte]]
[[File:Conegrave.jpg|thumb|275px|right|Bonnie Cone's final resting place on the campus of UNC Charlotte, with Cato Hall and Fretwell Hall in the background. Also thought to be the meeting place of Diu Memoriae Consilium.]]
[[File:UNCCNewQuad.jpg|302x302px|thumb|right|This quad-style area was completed in 2007 with the completion of the College of Health and Human Services (left) and the Cato College of Education (right).]]
[[File:UNC Charlotte Center City Campus.jpg|alt=UNC Charlotte's Center City Campus|left|thumb|UNC Charlotte's Center City Campus is located on 9th Street in Uptown Charlotte. The building is home to a number of graduate-level programs in order to meet the needs of working professionals in the second largest financial city in America.]]
[[File:Belktower.jpg|233x233px|thumb|The Carillon and J. Murrey Atkins Library entrance on UNC Charlotte's main campus (left) and the Belk Tower (middle), 

**Task 3.d** Use a regular expression to remove all category links from the source.

In [9]:
categ_re = re.compile(r'\[\[Category:.*\]\]+')
mc = categ_re.findall(source)
print('Found', len(mc), 'categories.')
print(mc)

source = categ_re.sub('', source)
print(source)

Found 6 categories.
['[[Category:University of North Carolina at Charlotte| ]]', '[[Category:Educational institutions established in 1946]]', '[[Category:Universities and colleges in Charlotte, North Carolina]]', '[[Category:Public universities and colleges in North Carolina|University of North Carolina at Charlotte]]', '[[Category:Universities and colleges accredited by the Southern Association of Colleges and Schools]]', '[[Category:1946 establishments in North Carolina]]']
{{Use mdy dates|date=October 2011}}
{{Infobox university
| name = The University of North Carolina at Charlotte
| native_name = 
| latin_name = 
| image = UNC Charlotte seal.png
| image_upright = .7
| motto = 
| established = {{start date and age|1946}}
| type = Public
| parent = UNC System
| endowment = $230.35 million (2019)
| staff = 
| faculty = 1,456
| president = 
| provost = Joan Lorden
| principal = 
| rector = 
| chancellor = Sharon Gaber
| vice_chancellor = Kevin Bailey
| dean = Christine Reed Davis
| he

**Task 3.e** *Mandatory for graduate students, optional (bonus points) for undergraduate students*

- Remove all templates and infoboxes from the source document.
    - These are any strings of the type '{{ ... }}'
    - Beware that there can be multiple levels of nesting, e.g. '{{ ... {{ .. {{ .... }} .. }} ... }}'. This cannot be matched with regular expressions (explain why).

In [10]:
def remove_templates(s):
    # YOUR CODE GOES HERE
    z = 0
    refstr = ''
    for line in source:
        z += 1
        x = 0
        y = 0
    
    #print('In line ',z)
        for i in range(len(line)-1):
            if(line[i]=='{' and line[i+1]=='{'):
                x += 1
            elif(line[i]=='}' and line[i+1]=='}'):
                y += 1
        if(x == y):
        #print("Matched flower brackets pair")
            templRegex = re.compile(r'\{\{.*\}\}')
            arrayFind = templRegex.findall(line)
        #print('Value found',arrayFind)
            refstr = refstr + templRegex.sub('',line)
        
    newstr = ''
    infoPattern = re.compile(r'\|.*\=.*')
    infoboxArr = infoPattern.findall(refstr)
    newstr = newstr + infoPattern.sub('',refstr)
    #print(newstr)
    #finalstr = ''
    #patMatch = re.compile(r'\{\{(.*)|(.*)\}\}')
    #finalstr = finalstr + patMatch.sub('',newstr)
    #return finalstr
    return newstr

#source = remove_templates(source)
source = remove_templates(source)
print(source)

{{Use mdy dates
{{Infobox university








































}}



'''The University of North Carolina at Charlotte''' ('''UNC Charlotte''', '''UNCC''', or simply '''Charlotte'''

UNC Charlotte is the largest institution of higher education in the Charlotte region. The university has experienced rapid enrollment growth of 33% over the past 10 years, making it the fastest-growing institution in the UNC System and contributing to more than 50% of the system's growth since 2009. In 2020, it surpassed the University of North Carolina at Chapel Hill to become the second-largest school in the UNC system by student enrollment.

It has three campuses: Charlotte Research Institute Campus, Center City Campus, and the main campus, located in University City. The main campus sits on 1,000 wooded acres with approximately 85 buildings about {{convert|8|mi|km}} from Uptown Charlotte.

==History==

The city of Charlotte had sought a public university since 1871 but was never able to su

**Task 4 [Bonus points]** Anything extra goes here.

In the last question, the whole set of characters within {{..{{...{{..}}..}}..}} can not be matched by a single regular expression. This is because, irrespective of whether a greedy matching expression like .+ or a non greedy matching expression like (.+?) is used, the engine that processes the regular expression during the re.compile() command does not interpret the context-free grammar (CFG) used to parse this expression. In the study of Formal Languages and Automata Theory, a CFG is defined as a grammar lemma that would generate an unambiguous set of characters that can be interpreted by a PDA(Push Down Automata), which is essentially a finite state machine.

In addition to this, the regular expression engine is not built to behave like a Finite State Machine. 
For instance, if we consider a string to be matched of the form {{..{{...{{..}}..}}..}} the engine would match x{{..{{..{{..}}x..}}..}}, where x denotes the positions of the character matches.

In order to match the entire set in one shot, one must write code for a parser to process this.