# RWET : Regular Expressions - Mar 24

## Escape sequences in Python

How can we add `"` and `'` characters to a string, without *breaking* python? Use `\`!

In [1]:
print("And then she said, \"I'm really interested in computer programming\"")

And then she said, "I'm really interested in computer programming"


Also, some special characters can only be typed by prepending a `\`.
- `\n` makes a new line
- `\t` makes a tab character
- `\\` to actually show a backslash if it's before another character

In [2]:
print("two\ttabbed\nlines\there")

two	tabbed
lines	here


In [3]:
print("let's show\nthe new line\ncharacter \\n")

let's show
the new line
character \n


In [4]:
print("here's two backslashes: \\\\ nice two backslashes!")

here's two backslashes: \\ nice two backslashes!


## [regular expressions](https://en.wikipedia.org/wiki/Regular_expression)


In [10]:
input_str = "here's a zip code: 11234. this is not a zip code: 345. But this is another zip code: 32109. nicely done! oh, and note that 114453 is not a zip code and 11342-1123 is an extended zip+4 code"

how could we extract only the zip codes from the previous string?

let's think about the steps...
- iterate over each character and look for numbers
- keep track of how many numbers in a row we've seen
- if the number of numbers is 5 AND there's a character after that is not a number, then add it to the list

In [11]:
current = ""

for c in input_str:
    if c.isdigit():
        current += c
    else:
        if len(current) == 5:
            print(current)
        current = ""

11234
32109
11342


That was not that complex (thankfully and naturally for this example), but if we want to find zip+4 codes (11215-3544)... well, fuck

But, wait! There's regular expressions for that! (well du'h)

`re` is the python package for regex

In [12]:
import re
re.findall(r"\b\d{5}\b",input_str)

['11234', '32109', '11342']

In [13]:
subjects = [item.strip() for item in open("sources/enronsubjects.txt").readlines()]

In [15]:
import random as rng

print(len(subjects))
rng.sample(subjects, 5)

176825


['RE:',
 'Genesis Park',
 'What happened on Monday',
 'FW: Call to SCE Regarding Negative CTC Credits',
 'ISDA Master Agreement and Credit Support Annex']

In [16]:
[item for item in subjects if "shopping" in item]

["FW: We've been shopping!",
 'Re: Start shopping...',
 'Start shopping...',
 'FW: Online shopping',
 'Online shopping']

In [17]:
[item for item in subjects if "shipping" in item]

['FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'lng shipping']

but what if we want BOTH lists (in one line, of course)

In [19]:
[item for item in subjects if re.search("sh.pping", item)]

['FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 "FW: We've been shopping!",
 'Re: Start shopping...',
 'Start shopping...',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'FW: Online shopping',
 'Online shopping']

### metacharacters

characters that, when used in regular expressions have a different meaning

- `.`  : any character
- `\w` : any alphanumeric character (a-z, A-Z, 0-9, _ )
- `\s` : any whitespace character (space, tab, newline)
- `\S` : any non-whitespace character
- `\d` : any digit (0-9)
- `\.` : an actual period
- `^` : non

Python has it's own rules for what a `\` is. But regex have their own language, so we use `r""` (stands for "raw") to indicate that the string is a raw input, and python doesn't interpret the `\` things

In [20]:
[item for item in subjects if re.search(r"\d:\d\d\wm", item)]

['RE: 3:17pm',
 '3:17pm',
 "RE: It's On!!! - 2:00pm Today",
 "FW: It's On!!! - 2:00pm Today",
 "It's On!!! - 2:00pm Today",
 'Re: Registration Confirmation: Larry Summers on 12/6 at 1:45pm (was',
 'Re: Conference Call today 2/9/01 at 11:15am PST',
 'Conference Call today 2/9/01 at 11:15am PST',
 '5/24 1:00pm conference call.',
 '5/24 1:00pm conference call.',
 'FW: 07:33am EDT 15-Aug-01 Prudential Securities (C',
 'FW: 07:33am EDT 15-Aug-01 Prudential Securities (C',
 '07:33am EDT 15-Aug-01 Prudential Securities (C',
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Updated Mar'00 Requirements Received at 11:25am from CES",
 'Reminder: Legal Team Meeting -- Friday, 9:00am Houston time',
 'Thursday, March 7th 1:30-3:00pm: REORIENTATION',
 'Meeting at 2:00pm Friday',
 'Meeting at 2:00pm Friday',
 'Fw: 12:30pm Deadline for changes to letters or

In [21]:
print(r"any backslash like this one: \n is interpreted as a backslash -usually-")

any backslash like this one: \n is interpreted as a backslash -usually-


In [22]:
# it is really useful to type things like...
print(r"here's how you type a new line in a string: \n :D")

here's how you type a new line in a string: \n :D


### character classes

- `[aeiou]` : vowels
- `[02468]` : even numbers
- `[Ee]` : either `e` or `E`
- `[a-z]` : any lowercase letter
- `[^aeiou]` : matches non vowels

In [23]:
[item for item in subjects if re.search(r"[aeiou][aeiou][aeiou][aeiou]", item)]

['Re: Natural gas quote for Louiisiana-Pacific (L-P)',
 'WooooooHoooooo more Vacation',
 'Re: Clickpaper Counterparties waiting to clear the work queue',
 'Gooooooooooood Bye!',
 'Gooooooooooood Bye!',
 'RE: Hello Sweeeeetie',
 'Hello Sweeeeetie',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'Re: FW: Wasss Uuuuuup STG?',
 'RE: Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'FW: The Osama Bin Laden Song ( Soooo Funny !! )',
 'Fw: The Osama Bin Laden Song ( Soooo Funny !! )',
 'The Osama Bin Laden Song ( Soooo Funny !! )',
 'RE: duuuuhhhhh',
 'RE: duuuuhhhhh',
 'RE: duuuuhhhhh',
 'duuuuhhhhh',
 'RE: duuuuhhhhh',
 'duuuuhhhhh',
 'RE: FPL Queue positions 1-15',
 'Re: FPL Queue positions 1-15',
 'Re: Helloooooo!!!',
 'Re: Helloooooo!!!',
 'Fw: FW: OOOooooops',
 'FW: FW: OOOooooops',
 'Re: yeeeeha',
 'yeeeeha',
 'yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo

In [24]:
[item for item in subjects if re.search(r"[Uu]niversity", item)]

['Fw: Big 12 Conference, University of Texas, Document 1629_108',
 'Big 12 Conference, University of Texas, Document 1629_108 Football',
 'Re: The Next HEISMAN winner for the University of Texas',
 'Re: The Next HEISMAN winner for the University of Texas',
 'The Next HEISMAN winner for the University of Texas',
 'The Next HEISMAN winner for the University of Texas',
 'RE: Information from University of Colorado',
 'FW: Information from University of Colorado',
 'RE: Information from University of Colorado',
 'RE: Information from University of Colorado',
 'FW: Information from University of Colorado',
 'RE: Information from University of Colorado',
 'RE: Information from University of Colorado',
 'FW: Information from University of Colorado',
 'RE: Information from University of Colorado',
 'RE: Information from University of Colorado',
 'FW: Information from University of Colorado',
 'FW: University of California et al. v. EES',
 'University of California et al. v. EES',
 'University 

### anchors

- `^` beginning of line
- `$` end of line
- `\b` word boundary

In [25]:
[item for item in subjects if re.search(r"^New York", item)]

['New York Details',
 'New York Power Authority',
 'New York Power Authority',
 'New York Power Authority',
 'New York Power Authority',
 'New York',
 'New York',
 'New York',
 'New York, etc.',
 'New York, etc.',
 'New York sites',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York',
 'New York',
 'New York City Marathon Guaranteed Entry',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas ("NYSEG")',
 'New York regulatory restriccions',
 'New York regulatory restriccions',
 'New York Bar Numbers']

In [26]:
[item for item in subjects if re.search(r"\.\.\.$", item)]

['Re: Inquiry....',
 'Re: Inquiry....',
 'RE: the candidate we spoke about this morning...',
 'the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'the candidate we spoke about this morning...',
 'Re: Hmmmmm........',
 'Hmmmmm........',
 'FW: Bumping into the husband....',
 'FW: Bumping into the husband....',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: try this one...',
 'RE: try this one...',
 '

In [28]:
[item for item in subjects if re.search(r"\b[Oo]il\b$", item)]

['B & J Gas and Oil',
 'Re: B & J Gas and Oil',
 'Re: B & J Gas & Oil',
 'Re: B & J Gas & Oil',
 'B & J Gas & Oil',
 'Re: B & J Gas & Oil',
 'Re: B & J Gas & Oil',
 'B & J Gas & Oil',
 'Husky Oil',
 'FW: GPCM News: 8/20/01:  RBAC Finalizing Schedule in Houston: Oil',
 'FW: Pioneer Oil',
 'Pioneer Oil',
 'RE: Cutter Oil',
 'Cutter Oil',
 'RE: Cutter Oil',
 'RE: Cutter Oil',
 'RE: Cutter Oil',
 'Cutter Oil',
 'Cross Timbers Oil',
 'Cutter Oil',
 'Murphy Oil',
 'Murphy Oil',
 'RE: DeBrular/Stevens Oil',
 'RE: DeBrular/Stevens Oil',
 'DeBrular/Stevens Oil',
 'RE: Stevens Oil',
 'RE: Stevens Oil',
 'Stevens Oil',
 'RE: DeBrular/Stevens Oil',
 'DeBrular/Stevens Oil',
 'Re: Colonial Oil',
 'Colonial Oil',
 'Wall Street Journal Article - Regarding SPR Oil',
 'FW: US Filter Comments on Omnibus and Annex A of a Heating Oil',
 'Fuel Oil',
 'Fuel Oil',
 'Re: ISDA for Irving Oil',
 'Ft. Pierce #2 Fuel Oil',
 'FW: Kennedy Oil',
 'Kennedy Oil',
 'Kennedy Oil',
 'FW: Kennedy Oil',
 'Kennedy Oil',
 'Ke

### quantifiers!

- `{n}` : exactly n times
- `{n,}` : matches at least n times
- `+` : match at least once
- `*` : match zero or more times
- `?` : match zero or one times

In [29]:
# search for at least 15 uppercase letters
[item for item in subjects if re.search(r"[A-Z]{15}", item)]

['CONGRATULATIONS!',
 'CONGRATULATIONS!',
 'Re: FW: Fw: Fw: Fw: Fw: Fw: Fw: PLEEEEEEEEEEEEEEEASE READ!',
 'ACCOMPLISHMENTS',
 'ACCOMPLISHMENTS',
 'Re: FW: FORM: BILATERAL CONFIDENTIALITY AGREEMENT',
 'FORM: BILATERAL CONFIDENTIALITY AGREEMENT',
 'Re: CONGRATULATIONS!',
 'CONGRATULATIONS!',
 'Re: ORDER ACKNOWLEDGEMENT',
 'ORDER ACKNOWLEDGEMENT',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'Re: CONGRATULATIONS',
 'CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'Re: CONGRATULATIONS',
 'CONGRATULATIONS',
 'Re: VEPCO INTERCONNECTION AGREEMENT',
 'VEPCO INTERCONNECTION AGREEMENT',
 'Re: VEPCO INTERCONNECTION AGREEMENT',
 'Re: VEPCO INTERCONNECTION AGREEMENT',
 'VEPCO INTERCONNECTION AGREEMENT',
 'Re: CONGRATULATIONS !',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'Re: FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAA

In [31]:
# look for (things that look like) email addresses
[item for item in subjects if re.search(r"\w+@\w+\.\w\w+", item)]

['Your Approval is Overdue: Access Request for paul.t.lucci@enron.com',
 'Your Approval is Overdue: Access Request for paul.t.lucci@enron.com',
 'Your Approval is Overdue: Access Request for paul.t.lucci@enron.com',
 'Request Submitted: Access Request for frank.ermis@enron.com',
 'Request Submitted: Access Request for frank.ermis@enron.com',
 'Your Approval is Overdue: Access Request for mike.grigsby@enron.com',
 'Your Approval is Overdue: Access Request for mike.grigsby@enron.com',
 'Your Approval is Overdue: Access Request for barry.tycholiz@enron.com',
 'unsubscribe don.baughman@enron.com',
 'Re: Request Submitted: Access Request for shona.wilson@enron.com',
 'Request Submitted: Access Request for shona.wilson@enron.com',
 'Request Submitted: Access Request for bob.m.hall@enron.com',
 'Request Submitted: Access Request for bob.m.hall@enron.com',
 'Re: Request Submitted: Access Request for mog.heu@enron.com',
 'Re: Request Submitted: Access Request for mog.heu@enron.com',
 'Request S

In [33]:
# find forwarded emails
[item for item in subjects if re.search(r"^F[Ww][Dd]?", item)]

['FW: ALL 1099 TAX QUESTIONS - ANSWERED',
 'FW: ALL 1099 TAX QUESTIONS - ANSWERED',
 'FW: Cross Commodity',
 'FW: Cross Commodity',
 'FW: fixed forward or other Collar floor gas price terms',
 'FW: fixed forward or other Collar floor gas price terms',
 'FW: charts',
 'FW: charts',
 'FW: Bishops Corner',
 'FW: Western Wholesale Activities - Gas & Power Conf. Call',
 'FW: Western Wholesale Activities - Gas & Power Conf. Call',
 'FW: charts',
 'FW: NEWGen June Release',
 'FW: Crossroads Storage Project',
 'FW: Crossroads Storage Project',
 'FW: Meeting to discuss West gas desk "FERC messages"',
 'FW:',
 'FW:',
 'FW: The Stage',
 'FW: Goldman Comment re: Enron issued this morning - Revised Price',
 'FW: California gas intrastate matters',
 'FW: El Paso Announces Binding Open Season for Additional Capacity',
 'FW: California gas intrastate matters - July 11 conference call',
 'FW: West Power Strategy Briefing',
 'FW:',
 'FW: Party',
 'FW: CA Instrate Gas matters',
 'FW: American Express Let

### alternation!

        (?:x|y) matches either x OR y

In [44]:
# find all replies or forwards

re_or_fwd = [item for item in subjects if re.search(r"^(?:R[Ee]|F[Ww][Dd]?)", item)]

print("there are",len(re_or_fwd),"replies or forward emails, out of",len(subjects),"emails.\nThat's a percentage of",len(re_or_fwd)/len(subjects))

there are 98185 replies or forward emails, out of 176825 emails.
That's a percentage of 0.5552665064329139


### capturing what matches

In [45]:
alphabet = "alpha beta gamma delta epsilon zeta eta theta"

In [46]:
# let's find all words that are 5 characters long (with regex, of course)

re.findall(r"\b\w{5}\b", alphabet)

['alpha', 'gamma', 'delta', 'theta']

Let's see what if we just have a HUGE string, not a list... and use regex!

In [48]:
all_subjects = open("sources/enronsubjects.txt").read()

In [49]:
# let's find all the zip codes!

re.findall(r"\b\d{5}\b", all_subjects)

['20267',
 '20267',
 '20267',
 '20267',
 '15956',
 '22069',
 '22069',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '27291',
 '35842',
 '35842',
 '35842',
 '35842',
 '27291',
 '27291',
 '27291',
 '25672',
 '27190',
 '25672',
 '27190',
 '25672',
 '27190',
 '22585',
 '22585',
 '22585',
 '22585',
 '22585',
 '22585',
 '22585',
 '22585',
 '22585',
 '11781',
 '11781',
 '78158',
 '78158',
 '21349',
 '21349',
 '40387',
 '40387',
 '19818',
 '10000',
 '10000',
 '80110',
 '39474',
 '30643',
 '30643',
 '30643',
 '24690',
 '24690',
 '24690',
 '24690',
 '26532',
 '78033',
 '78032',
 '78033',
 '78032',
 '92886',
 '92886',
 '96731',
 '96731',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '34342',
 '93481',
 '93481',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',
 '93836',


In [51]:
# or let's find all .com domains!

com_domains = re.findall(r"\b\w+\.com", all_subjects)

In [52]:
from collections import Counter

Counter(com_domains)

Counter({'1400smith.com': 2,
         'ABCNEWS.com': 16,
         'Abestkitchen.com': 2,
         'Agency.com': 9,
         'Alamo.com': 4,
         'Amazon.com': 12,
         'Anywhere.com': 1,
         'ArdorNY.com': 5,
         'BIGWORDS.com': 1,
         'BadMojo09092hotmail.com': 1,
         'Bid4me.com': 2,
         'Blair.com': 2,
         'Braodcast.com': 1,
         'Broker.com': 3,
         'CERA.com': 3,
         'CVS.com': 2,
         'CapacityCenter.com': 2,
         'CareerPath.com': 17,
         'Center.com': 2,
         'Chematch.com': 1,
         'ClickPaper.com': 10,
         'Clickpaper.com': 5,
         'Colonize.com': 1,
         'CommodityLogic.com': 2,
         'Compaq.com': 3,
         'Concierge.com': 4,
         'Cortlandtwines.com': 2,
         'Credit.com': 5,
         'Credit2B.com': 4,
         'DefensiveDriver.com': 2,
         'Dictionary.com': 3,
         'ESPN.com': 4,
         'Edmunds.com': 1,
         'EnergyGateway.com': 2,
         'EnergyPrism.co

In [53]:
Counter(com_domains).most_common()

[('enron.com', 234),
 ('EnronCredit.com', 42),
 ('NYTimes.com', 22),
 ('CareerPath.com', 17),
 ('Nodocero.com', 17),
 ('ABCNEWS.com', 16),
 ('Match.com', 13),
 ('Amazon.com', 12),
 ('ubsenergy.com', 11),
 ('taxclaity.com', 10),
 ('ClickPaper.com', 10),
 ('ubswenergy.com', 9),
 ('Agency.com', 9),
 ('har.com', 8),
 ('FT.com', 8),
 ('UBSWenergy.com', 8),
 ('MarketWatch.com', 7),
 ('risk.com', 7),
 ('southwest.com', 6),
 ('clanpages.com', 6),
 ('reactionsnet.com', 6),
 ('Omaha.com', 6),
 ('HoustonChronicle.com', 5),
 ('ArdorNY.com', 5),
 ('Clickpaper.com', 5),
 ('Credit.com', 5),
 ('Enroncredit.com', 5),
 ('ScottPaul.com', 5),
 ('EnronOnline.com', 5),
 ('Concierge.com', 4),
 ('washingtonpost.com', 4),
 ('ESPN.com', 4),
 ('Alamo.com', 4),
 ('RedMeteor.com', 4),
 ('Credit2B.com', 4),
 ('insiderSCORES.com', 4),
 ('Expedia.com', 4),
 ('Travelocity.com', 4),
 ('Quicken.com', 4),
 ('EnergyPrism.com', 4),
 ('edftrading.com', 4),
 ('WeatherMarkets.com', 4),
 ('FitRx.com', 4),
 ('merckmedco.com', 3

In [55]:
# to remove duplicates

unique_domains = list(set(com_domains))
unique_domains

['hannaandersson.com',
 'ArdorNY.com',
 'al.com',
 'iWon.com',
 'BIGWORDS.com',
 'alan.com',
 'HoustonStreet.com',
 'HoustonChronicle.com',
 'Credit2B.com',
 'myuhc.com',
 'Travelocity.com',
 'ZanyBrainy.com',
 'MyFamily.com',
 'RedMeteor.com',
 'merckmedco.com',
 'Paper.com',
 'Amazon.com',
 'enerfax.com',
 'brcepat.com',
 'SmartMoney.com',
 'Cortlandtwines.com',
 'U2.com',
 'u2.com',
 'Agency.com',
 'boxmind.com',
 'Fingerhut.com',
 'clanpages.com',
 'InfrastructureWorld.com',
 'StudentMagazine.com',
 'nexant.com',
 'ubswenergy.com',
 'Concierge.com',
 'INSIDER.com',
 'ft.com',
 'PrimeShot.com',
 'washingtonpost.com',
 'CommodityLogic.com',
 'insiderSCORES.com',
 'continental.com',
 'enron.com',
 'Credit.com',
 'reactionsnet.com',
 'turnonthetruth.com',
 'FitRx.com',
 'Nice.com',
 'Markets.com',
 'flashyourrack.com',
 'Individual.com',
 'Industrialinfo.com',
 'Ticketmaster.com',
 'har.com',
 'Grassy.com',
 'educationplanet.com',
 'southwest.com',
 'WeatherMarkets.com',
 '1400smith.co

In [59]:
# if we want to know the mentions about New York and the word after...

set(re.findall(r"\bNew York \b\w+\b",all_subjects))

{'New York Bar',
 'New York Branch',
 'New York City',
 'New York Details',
 'New York Energy',
 'New York Hotel',
 'New York Inc',
 'New York Mercantile',
 'New York Office',
 'New York Power',
 'New York State',
 'New York Times',
 'New York on',
 'New York regulatory',
 'New York sites',
 'New York voice'}

In [58]:
# but if we only want the following word, not the "New York" bit...
# the () gives only that as an output!

list(set(re.findall(r"\bNew York (\b\w+\b)",all_subjects)))

['Times',
 'Energy',
 'voice',
 'Inc',
 'Details',
 'Power',
 'regulatory',
 'Mercantile',
 'Branch',
 'sites',
 'on',
 'State',
 'City',
 'Office',
 'Bar',
 'Hotel']

In [None]:
# dollar amounts!

dollars = re.findall(r"\$\d+ [kmbKMB]?", all_subjects)
print(dollars)