# Structure and Text in Wikipedia

In this assignment, you will use regular expressions to process the source wiki of Wikipedia articles with the purpose of:

1. Extracting useful information from structures such as piped links and category links.
2. Extract the table of contents of the article.
2. Extract a clean version of the text that can be used for NLP.
    - This is done by removing references, infoboxes, pictures, and categories.
    - Piped links *[[string1|string2]]* would need to be replaced with the surface string *string2*.

## Write Your Name Here: Naimisha Churi

# <font color="blue"> Submission Instructions</font>

1. Click the Save button at the top of the Jupyter Notebook.
2. Please make sure to have entered your name above.
3. Select Cell -> All Output -> Clear. This will clear all the outputs from all cells (but will keep the content of ll cells). 
4. Select Cell -> Run All. This will run all the cells in order, and will take several minutes.
5. Once you've rerun everything, select File -> Download as -> PDF via LaTeX and download a PDF version *wikipedia.pdf* showing the code and the output of all cells, and save it in the same folder that contains the notebook file *wikipedia.ipynb*.
6. Look at the PDF file and make sure all your solutions are there, displayed correctly. The PDF is the only thing we will see when grading!
7. Submit **both** your PDF and notebook on Canvas.
8. Make sure your your Canvas submission contains the correct files by downloading it after posting it on Canvas.

Read the source of the Wikipedia article. For debugging purposes, you may consider using a shorter article first.

In [160]:
# You can use this shorter article first. 
# fname = '../data/FM-2030.txt'
fname = '../data/University_of_North_Carolina_at_Charlotte.txt'
source = open(fname, 'r', encoding = 'utf-8').read()

**Task 1.a** Design a regular expression *piped* that matches piped strings of the type *[[string1|string2]]*. Use parantheses to group string1 and string2, such that *piped.findall(source)* returns them in a tuple *(string1, string2)*. For example, when run on the source of the UNCC article, the code below should result in a list that starts as *[('Public university', 'Public'), ('University of North Carolina', 'UNC System'), ...]* and that contains 44 elements.

In [161]:
import re

piped_re = re.compile(r'\[\[([^\[\]\|]*?)\|([^\[\]\|]*?)\]\]')
mp = piped_re.findall(source)
print('Found', len(mp), 'piped links.')
print(mp)

Found 44 piped links.
[('Public university', 'Public'), ('University of North Carolina', 'UNC System'), ('Charlotte, North Carolina', 'Charlotte'), ('University City (Charlotte neighborhood)', 'University City'), ('Conference USA', 'C-USA'), ('Charlotte 49ers', '49ers'), ('Association of Public and Land-Grant Universities', 'APLU'), ('Coalition of Urban and Metropolitan Universities', 'CUMU'), ('Oak Ridge Associated Universities', 'ORAU'), ('public education', 'public'), ('Carnegie Classification of Institutions of Higher Education', 'classified'), ('University City, North Carolina', 'University City'), ('Raleigh, North Carolina', 'Raleigh'), ('University of North Carolina School of Medicine', 'existing two-year school'), ('University of North Carolina at Chapel Hill', 'UNC-Chapel Hill'), ('post–World War II baby boom', 'post–World War II'), ('Cabarrus County, North Carolina', 'Cabarrus County'), ('James H Woodward', 'James H. Woodward'), ('James H Woodward', 'James H. Woodward'), ('Pr

**Task 1.b** Design a regular expression *categ* that matches category strings of the type *[[Category:name]]*. Use parantheses to group the name part. When run on the source of the UNCC article, the code below should result in a list that starts as *['University of North Carolina at Charlotte| ', 'Educational institutions established in 1946', ...]* and that contains 6 elements.

In [162]:
categ_re = re.compile(r'\[\[Category:(.*)\]\]')
mc = categ_re.findall(source)
print('Found', len(mc), 'categories.')
print(mc)

Found 6 categories.
['University of North Carolina at Charlotte| ', 'Educational institutions established in 1946', 'Universities and colleges in Charlotte, North Carolina', 'Public universities and colleges in North Carolina|University of North Carolina at Charlotte', 'Universities and colleges accredited by the Southern Association of Colleges and Schools', '1946 establishments in North Carolina']


**Task 2** Extract the table of contents of the article, i.e. a list of all the section titles in the article. When run on the UNCC article, it should find 33 section titles.

In [163]:
title_re = re.compile(r'==+(.+[^=])==')
mt = title_re.findall(source)
print('found', len(mt), 'titles.')
for title in mt:
    print(title)

found 33 titles.
History
Leaders of the university
Bonnie Ethel Cone, founder
Chancellors
 2019 shooting 
Campuses
Main Campus – University City
Charlotte Research Institute Campus
Center City Campus
Students
Academics
Colleges and programs
Scholarships
Library system
Athletics
Men's basketball
Women's basketball
Baseball
Football
Golf
Men's soccer
Track and field
Volleyball
Student organizations
University name
Transportation on campus
Niner Transit
Light rail
CATS buses
Notable alumni and faculty
 See also 
References
External links


**Task 3.a** Design a regular expression *ref_re* that matches reference strings enclosed between reference tags "<ref ...> ... <\/ref>" so that they can be eliminated from the document. Beware also of the alternative form "<ref ...\/>".

In [164]:
ref_re = re.compile(r'<ref.*?>.*?[^<>]</ref>|<ref.*?/>|<ref>.*?[^<>]</ref>')
mref = ref_re.findall(source)
print('found', len(mref), 'references.')
for ref in mref:
    print(ref)

found 54 references.
<ref name="University History">{{cite web|url=http://publicrelations.uncc.edu/information-media-kit/university-history|title=University History - Office of News and Information - UNC Charlotte|website=publicrelations.uncc.edu|access-date=May 11, 2017}}</ref>
<ref>As of June 13, 2020. {{cite web| title = National Association of College and University Business Officers| work = U.S. and Canadian Institutions Listed by Fiscal Year (FY) 2019 Endowment Market Value and Change* in Endowment Market Value from FY2018 to FY2019| url =https://www.nacubo.org/-/media/Nacubo/Documents/EndowmentFiles/2019-Endowment-Market-Values--Final-Feb-10.ashx?la=en&hash=E71088CDC05C76FCA30072DA109F91BBC10B0290| format = PDF| access-date = June 13, 2020 }}</ref>
<ref>{{cite web |url=http://publicrelations.uncc.edu/sites/publicrelations.uncc.edu/files/media/factsheet_November2013.pdf |title=Archived copy |access-date=2014-08-05 |url-status=dead |archive-url=https://web.archive.org/web/20140729

Remove all references from the source string.

In [165]:
source = ref_re.sub('', source)
print(source)

{{Use mdy dates|date=October 2011}}
{{Infobox university
| name = The University of North Carolina at Charlotte
| native_name = 
| latin_name = 
| image = UNC Charlotte seal.png
| image_upright = .7
| motto = 
| established = {{start date and age|1946}}
| type = [[Public university|Public]]
| parent = [[University of North Carolina|UNC System]]
| endowment = $230.35 million (2019)
| staff = 
| faculty = 1,456
| president = 
| provost = Joan Lorden
| principal = 
| rector = 
| chancellor = Sharon Gaber
| vice_chancellor = Kevin Bailey
| dean = Christine Reed Davis
| head_label = 
| head = 
| students = 30,146 (Fall 2020)
| undergrad = 24,175 (Fall 2020)
| postgrad = 5,971 (Fall 2020)
| city = [[Charlotte, North Carolina|Charlotte]]
| state = [[North Carolina]]
| country = United States
| campus = [[University City (Charlotte neighborhood)|University City]]<br />{{convert|1000|acre|km2|1|abbr=on}}
| former_names = Charlotte Center of the University of North Carolina (1946–1949)<br />Char

**Task 3.b** Replace all piped links [[string1|string2]] and [[string2]] with the surface string string2.

In [166]:
pip_re1 = re.compile(r'\[\[[^\[\]\|:]*?\|([^\[\]\|]*?)\]\]')
pip_re2 = re.compile(r'\[\[([^\[\]\|:]*?)\]\]')
mpip = pip_re1.findall(source1)
print('found', len(mpip), 'piped links.')
for pip in mpip:
    print(pip)
#     #source
#     if pip[0]:
#         print(pip[0])
#         m.append(pip[0])
#     else:
#         print(pip[1])
#         m.append(pip[1])
# for n in m:
#     source1 = pip_re1.sub('', source1, 1)
#print(m)
source = pip_re1.sub('\\1', source)
source = pip_re2.sub('\\1', source)
print(source)

found 0 piped links.
{{Use mdy dates|date=October 2011}}
{{Infobox university
| name = The University of North Carolina at Charlotte
| native_name = 
| latin_name = 
| image = UNC Charlotte seal.png
| image_upright = .7
| motto = 
| established = {{start date and age|1946}}
| type = Public
| parent = UNC System
| endowment = $230.35 million (2019)
| staff = 
| faculty = 1,456
| president = 
| provost = Joan Lorden
| principal = 
| rector = 
| chancellor = Sharon Gaber
| vice_chancellor = Kevin Bailey
| dean = Christine Reed Davis
| head_label = 
| head = 
| students = 30,146 (Fall 2020)
| undergrad = 24,175 (Fall 2020)
| postgrad = 5,971 (Fall 2020)
| city = Charlotte
| state = North Carolina
| country = United States
| campus = University City<br />{{convert|1000|acre|km2|1|abbr=on}}
| former_names = Charlotte Center of the University of North Carolina (1946–1949)<br />Charlotte College (1949–1965)
| free_label = 
| free = 
| athletics = NCAA Division I – C-USA
| sports = 18 varsity s

**Task 3.c** Design a regular expression file_re that matches file strings of the type *[[File: ...]]*. Use the regular expression to remove all file strings from the source.

In [167]:
file_re = re.compile(r'\[\[File:(.*)\]\]')
mfile = file_re.findall(source)
print('found', len(mfile), 'file links.')
for file in mfile:
    print(file)

source = file_re.sub('', source)
print(source)

found 6 file links.
Ucity.jpg|thumb|250px|Aerial view of UNC Charlotte
Conegrave.jpg|thumb|275px|right|Bonnie Cone's final resting place on the campus of UNC Charlotte, with Cato Hall and Fretwell Hall in the background. Also thought to be the meeting place of Diu Memoriae Consilium.
UNCCNewQuad.jpg|302x302px|thumb|right|This quad-style area was completed in 2007 with the completion of the College of Health and Human Services (left) and the Cato College of Education (right).
UNC Charlotte Center City Campus.jpg|alt=UNC Charlotte's Center City Campus|left|thumb|UNC Charlotte's Center City Campus is located on 9th Street in Uptown Charlotte. The building is home to a number of graduate-level programs in order to meet the needs of working professionals in the second largest financial city in America.
Belktower.jpg|233x233px|thumb|The Carillon and J. Murrey Atkins Library entrance on UNC Charlotte's main campus (left) and the Belk Tower (middle), which was torn down in 2016
JRS Entrance 3.

**Task 3.d** Use a regular expression to remove all category links from the source.

In [168]:
categ_re = re.compile(r'\[\[Category:(.*)\]\]')
mc = categ_re.findall(source)
print('Found', len(mc), 'categories.')
print(mc)

source = categ_re.sub('', source)
print(source)

Found 6 categories.
['University of North Carolina at Charlotte| ', 'Educational institutions established in 1946', 'Universities and colleges in Charlotte, North Carolina', 'Public universities and colleges in North Carolina|University of North Carolina at Charlotte', 'Universities and colleges accredited by the Southern Association of Colleges and Schools', '1946 establishments in North Carolina']
{{Use mdy dates|date=October 2011}}
{{Infobox university
| name = The University of North Carolina at Charlotte
| native_name = 
| latin_name = 
| image = UNC Charlotte seal.png
| image_upright = .7
| motto = 
| established = {{start date and age|1946}}
| type = Public
| parent = UNC System
| endowment = $230.35 million (2019)
| staff = 
| faculty = 1,456
| president = 
| provost = Joan Lorden
| principal = 
| rector = 
| chancellor = Sharon Gaber
| vice_chancellor = Kevin Bailey
| dean = Christine Reed Davis
| head_label = 
| head = 
| students = 30,146 (Fall 2020)
| undergrad = 24,175 (Fa

**Task 3.e** *Mandatory for graduate students, optional (bonus points) for undergraduate students*

- Remove all templates and infoboxes from the source document.
    - These are any strings of the type '{{ ... }}'
    - Beware that there can be multiple levels of nesting, e.g. '{{ ... {{ .. {{ .... }} .. }} ... }}'. This cannot be matched with regular expressions (explain why).

In [169]:
def remove_templates(s):
    #🥺
    # formal languages cannot represent the non deterministic repeating with the 
    stack = []
    



source = remove_templates(source)
print(source)

Found 0 templates.
Found 0 templates.
{{Use mdy dates|date=October 2011}}
{{Infobox university
| name = The University of North Carolina at Charlotte
| native_name = 
| latin_name = 
| image = UNC Charlotte seal.png
| image_upright = .7
| motto = 
| established = {{start date and age|1946}}
| type = Public
| parent = UNC System
| endowment = $230.35 million (2019)
| staff = 
| faculty = 1,456
| president = 
| provost = Joan Lorden
| principal = 
| rector = 
| chancellor = Sharon Gaber
| vice_chancellor = Kevin Bailey
| dean = Christine Reed Davis
| head_label = 
| head = 
| students = 30,146 (Fall 2020)
| undergrad = 24,175 (Fall 2020)
| postgrad = 5,971 (Fall 2020)
| city = Charlotte
| state = North Carolina
| country = United States
| campus = University City<br />{{convert|1000|acre|km2|1|abbr=on}}
| former_names = Charlotte Center of the University of North Carolina (1946–1949)<br />Charlotte College (1949–1965)
| free_label = 
| free = 
| athletics = NCAA Division I – C-USA
| spor

**Task 3.f** *Mandatory for graduate students, optional (bonus points) for undergraduate students*

Design a regular expression that finds all occurences of integer numbers in an input strings and uses a substitution to replace them with the equivalent real numbers by appending '.0' to them. Use it to implement the function `realize(s)` below.

In [157]:
def realize(s):
    # YOUR CODE HERE
    reg = re.compile(r'[ -](\d+)(?!\.\d)')
    mc = reg.sub(' '+'\\1'+'.0',s)
    return mc
    
# This should print 'When we add 4.0 to 1.5 and 0.5 to -2.5 we end up with 5.5 and -2.0.'
print(realize('When we add 4 to 1.5 and 0.5 to -2.5 we end up with 5.5 and -2.'))

When we add 4.0 to 1.5 and 0.5 to -2.5 we end up with 5.5 and  2.0.


**Task 3.g [Bonus points]**

Design a search-and-replace function based on the `sub()` method for regular expressions that finds all occurences of integer numbers in an input strings and replaces them with an incremented version. Use it to implement the function `increment(s)` below.

*Hint: Read the <a href="https://docs.python.org/3/howto/regex.html">documentation</a> on the `sub()` function to see how to use it with a function argument.*

In [156]:
def increment(s):
    # YOUR CODE HERE
    reg = re.compile(r'[ -](\d+)')
    numbers = reg.findall(s)
    return  
    
# This should print 'Eve has 6 apples. She gives 4 to Adam and the remaining 3 to the snake.'
print(increment('Eve has 5 apples. She gives 3 to Adam and the remaining 2 to the snake.'))

None


**Task 4 [Bonus points]** Anything extra goes here.

In [125]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Eve has 5 apples. She gives 3 to Adam and the remaining 2 to the snake.")
for sent in doc.sents:
    num0_m = re.search(r'\d+', sent)                  # Extract the first chunk of 1+ digits
    if num0_m:                                      # If there is a match
        rx = r'(?<!\d){}(?!\d)'.format(num0_m.group())  # Set a regex to match the number when not inside other digits
        print(re.sub(rx, lambda x: str(int(x.group())+1), sent))
#    print(sent)


TypeError: expected string or bytes-like object

In [121]:
a1 = 'Eve has 5 apples. She gives 3 to Adam and the remaining 2 to the snake.'
num0_m = re.search(r'\d+', a1)                  # Extract the first chunk of 1+ digits
if num0_m:                                      # If there is a match
    rx = r'(?<!\d){}(?!\d)'.format(num0_m.group())  # Set a regex to match the number when not inside other digits
    print(re.sub(rx, lambda x: str(int(x.group())+1), a1))

Eve has 6 apples. She gives 3 to Adam and the remaining 2 to the snake.


In [133]:
a1 = 'Eve has 5 apples. She gives 3 to Adam and the remaining 2 to the snake.'
sents = a1.split('.')
for sent in sents:
    print(sent)
    num0_m = re.search(r'\d+', sent)                  # Extract the first chunk of 1+ digits
    if num0_m:                                      # If there is a match
        rx =re.compile(r'(?<!\d){}(?!\d)',re.VERBOSE)
        # Set a regex to match the number when not inside other digits)
        ints = re.findall(sent)
        print(re.sub(rx, lambda x: str(int(x.group())+1), sent))

Eve has 5 apples


TypeError: findall() missing 1 required positional argument: 'string'