# More Strings

Arguably one of the things Python does best is strings. It is capable of processing large strings en masse and doing operations on them. 

Today we will analyze a list of *puzzle words* that was compiled as part of the Moby lexicon project. This will also be our first example of Python working with files from our computer.

To start - you need to get a copy of these notes and the file with our data "CROSSWD.TXT" into the same directory. Or you can follow the instructions and use the second cell to get the file via the URL in Github.

## Jupyter

On Jupyter this is as simple as - before you open a notebook, uploading the .TXT file to the directory you are in.


In [None]:
words_file = open('CROSSWD.TXT')
# open creates a file object in Python for us to manipulate


## Google Colab

You need to use a module in order to read the file from a URL or use a module in order to read it from your Google Drive account. I like reading it from a URL because this means anyone with the .ipynb file can run the code and get the file. This method will work in Jupyter as well. Just choose the option you want and run that, and comment out the other one.

In [2]:
## Commenting this out - you would want to uncomment it if you are using Google Colab

from urllib.request import urlopen
words_file = urlopen('https://github.com/virgilpierce/CS_120/raw/main/CROSSWD.TXT')
# Github is public facing so I just add the link to the file I get by right clicking on the "Download" button for the file in Github and choosing copy url.

# If you get an error in Jupyter you either need to access the file using the cell above; or you need to install 
# the urllib module by using a Terminal and typing: pip install urllib

# Note that the urlopen does behave a little strangely. It is not loading the file all at once and instead queries the server line by line for it
# this will with a slow internet connection make this method slower than the open() above.


In [None]:
type(words_file)
# The type indicates that it is an Input/Output stream

http.client.HTTPResponse

In [None]:
[x for x in dir(words_file) if '_' !=x[0]]
# Let's check what methods we have

['begin',
 'chunk_left',
 'chunked',
 'close',
 'closed',
 'code',
 'debuglevel',
 'detach',
 'fileno',
 'flush',
 'fp',
 'getcode',
 'getheader',
 'getheaders',
 'geturl',
 'headers',
 'info',
 'isatty',
 'isclosed',
 'length',
 'msg',
 'peek',
 'read',
 'read1',
 'readable',
 'readinto',
 'readinto1',
 'readline',
 'readlines',
 'reason',
 'seek',
 'seekable',
 'status',
 'tell',
 'truncate',
 'url',
 'version',
 'will_close',
 'writable',
 'write',
 'writelines']

In [None]:
help(words_file.readline)
# we can get information about a method

Help on method readline in module http.client:

readline(limit=-1) method of http.client.HTTPResponse instance
    Read and return a line from the stream.
    
    If size is specified, at most size bytes will be read.
    
    The line terminator is always b'\n' for binary files; for text
    files, the newlines argument to open can be used to select the line
    terminator(s) recognized.



In [None]:
words_file.readline().decode('utf-8')
# each time we execute .readline() it reads the next line in the file as a string. Try it.

'aas\r\n'

In [3]:
words_file = urlopen('https://github.com/virgilpierce/CS_120/raw/main/CROSSWD.TXT')
k=0
while words_file.readline().decode('utf-8')[0] !='b':
  k+=1
words_file.readline().decode('utf-8').strip()


'baa'

In [4]:
k

6557

### Byte-Strings

If you are doing this using the URL method above for Google Colab - the string you just got probably has a *b* in front of it. This is how Python designates
a type called a byte-string. Byte strings are how computers encode characters beyond the standard alphabet we are using, and because the internet is international
sites like Github have to deliver their content in byte-strings rather than regular strings.

We know this file is made up entirely of regular strings and so we might want to remove the *b*.  We can do that by adding a .decode('utf-8') after the .readline().

'utf-8' specifies the encoding that the byte-string is using (in this case Github uses *Unicode Transformation 8-bit*). 

We don't really need the '\n' new line character and we can use the .strip() method to remove it:

In [None]:
words_file.readline().decode('utf-8').strip()
# Note that we can just string together methods - and you can start to see the reason they are written as .method()

'aasvogels'

Even better, the file object is an iterable:  meaning we can use it in a for loop:  Note if you execute the command that follows, you will probably have to use Interupt to stop it unless you want to wait a long time.

In [None]:
for line in words_file:
    word = line.decode('utf-8').strip()
    print(word)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
hatchways
hate
hateable
hated
hateful
hatefullness
hatefullnesses
hater
haters
hates
hatful
hatfuls
hath
hating
hatless
hatlike
hatmaker
hatmakers
hatpin
hatpins
hatrack
hatracks
hatred
hatreds
hats
hatsful
hatted
hatter
hatteria
hatterias
hatters
hatting
hauberk
hauberks
haugh
haughs
haughtier
haughtiest
haughtily
haughtiness
haughtinesses
haughty
haul
haulage
haulages
hauled
hauler
haulers
haulier
hauliers
hauling
haulm
haulmier
haulmiest
haulms
haulmy
hauls
haulyard
haulyards
haunch
haunched
haunches
haunt
haunted
haunter
haunters
haunting
hauntingly
haunts
hausen
hausens
hausfrau
hausfrauen
hausfraus
hautbois
hautboy
hautboys
hauteur
hauteurs
havdalah
havdalahs
have
havelock
havelocks
haven
havened
havening
havens
haver
havered
haverel
haverels
havering
havers
haves
having
havior
haviors
haviour
haviours
havoc
havocked
havocker
havockers
havocking
havocs
haw
hawed
hawfinch
hawfinches
hawing
hawk
hawkbill
hawkbills
haw

KeyboardInterrupt: ignored

## Program 1

Write a program that reads CROSSWD.TXT and prints only the words with more than 20 characters.

Note that in each of the Programs below we need to start by opening the file (or URL). It used to be very important to close the file when you are done - it is now less important **UNLESS** you are writing data to the file - in that case you need to close it before your operating system will ensure that the data sent to the file is actually stored to your systems disk. We will play with some file manipulation later in the semester.

In [None]:
words_file = urlopen('https://github.com/virgilpierce/CS_120/raw/main/CROSSWD.TXT')
for line in words_file:
    word = line.decode('utf-8').strip()
    if len(word)>20:
      print(word)


counterdemonstrations
hyperaggressivenesses
microminiaturizations


## Program 2

Write a function called *has_no_e* that takes a word and returns True if it has no e and False if it has an e.  

Then modify your Program 1 to print all the words that have no e.

In [None]:
def has_no_e(word):
  if 'e' in word:
    return False
  else:
    return True

In [None]:
words_file = urlopen('https://github.com/virgilpierce/CS_120/raw/main/CROSSWD.TXT')
for line in words_file:
    word = line.decode('utf-8').strip()
    if has_no_e(word):
      print(word)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
clastics
clasts
claucht
claught
claughting
claughts
clausal
claustrophobia
claustrophobias
clavichord
clavichords
claw
clawing
claws
claxon
claxons
clay
claybank
claybanks
claying
clayish
claypan
claypans
clays
click
clicking
clicks
cliff
cliffs
cliffy
clift
clifts
climactic
climatal
climatic
climax
climaxing
climb
climbing
climbs
clinal
clinally
clinch
clinching
cling
clinging
clings
clingy
clinic
clinical
clinically
clinician
clinicians
clinics
clink
clinking
clinks
clip
clipboard
clipboards
clipping
clippings
clips
clipt
cliquing
cliquish
cliquy
clitoral
clitoric
clitoris
cloaca
cloacal
cloak
cloaking
cloaks
clock
clocking
clocks
clockwork
clod
cloddish
cloddy
clodpoll
clodpolls
clods
clog
clogging
cloggy
clogs
clomb
clomp
clomping
clomps
clon
clonal
clonally
clonic
cloning
clonism
clonisms
clonk
clonking
clonks
clons
clonus
cloot
cloots
clop
clopping
clops
closing
closings
closuring
clot
cloth
clothing
clothings
cloth

KeyboardInterrupt: ignored

## Program 3

Write a function named *uses_only* that takes a word and a string of letters and returns True only if the word uses letters from the list.

Then modify Program 1 so that you can construct a sentence that uses the only the letters 'asdfjkl' if possible.

In [None]:
def uses_only(word, string):
  '''
  1. For each character in word: check if the character is in string.
  If not in string: return false
  if we get through wihtout returning false, then return true.
  '''
  for c in word:
    if c not in string:
      return False
  ##for loop has checked if each character is in the string
  ##the only way we get through the entire for loop is if every c in string.
  return True

In [None]:
words_file = urlopen('https://github.com/virgilpierce/CS_120/raw/main/CROSSWD.TXT')
for line in words_file:
    word = line.decode('utf-8').strip()
    if uses_only(word, 'asdfgjkl'):
      print(word)

aa
aal
aals
aas
ad
add
adds
ads
aff
aga
agas
ala
alas
alaska
alaskas
alfa
alfalfa
alfalfas
alfas
alga
algal
algas
all
alls
as
ask
asks
ass
da
dad
dada
dadas
dads
daff
daffs
dag
dags
dak
daks
fa
fad
fads
fag
fags
fall
fallal
fallals
falls
fas
flag
flags
flak
flask
flasks
gad
gads
gaff
gaffs
gag
gaga
gags
gal
gala
galas
gall
galls
gals
gas
glad
glads
glass
jag
jagg
jaggs
jags
ka
kaas
kaka
kakas
kas
la
lad
lads
lag
lags
lall
lalls
las
lass
sad
sag
saga
sagas
sags
sal
salad
salads
sall
sals
sass
skag
skags
skald
skalds
slag
slags


## Program 4 

Write a function named *uses_all* that takes a word and a string of letters and returns True if the word uses all of the letters from the list at least once but also uses any other letters.

How many words are there that use all of the vowels 'aeiou'?  How about 'aeiouy'?

In [None]:
def uses_all(word, string):
  return uses_only(string, word)

In [None]:
words_file = urlopen('https://github.com/virgilpierce/CS_120/raw/main/CROSSWD.TXT')
for line in words_file:
    word = line.decode('utf-8').strip()
    if uses_all(word, 'aeiouy'):
      print(word)

abstemiously
adventitiously
aeronautically
ambidextrously
antievolutionary
antirevolutionary
antiunemployment
authoritatively
autotypies
buoyancies
counterinflationary
evolutionary
extracommunity
facetiously
genitourinary
gregariously
hyperanxious
hypercautious
hyperfastidious
inconsequentially
instantaneously
intravenously
mendaciously
miscellaneously
nefariously
neurologically
neurotically
ostentatiously
outwearying
postrevolutionary
precariously
precautionary
prerevolutionary
revolutionary
sacrilegiously
simultaneously
tenaciously
uncomplimentary
unconventionally
unequivocally
unintentionally
unquestionably


## Program 5

Write a function called *is_alphabetical* that retursn True if the letters in a word appear in alphabetical order.

In [None]:
'a' < 'b', 'b' < 'a'


(True, False)

In [None]:
def is_alphabetical(word):
  #we need to loop through the characters in the owrd using their index (position)
  n = len(word)
  for j in range(n-1):
    if not (word[j]<word[j+1]):
      return False
  return True

In [None]:
words_file = urlopen('https://github.com/virgilpierce/CS_120/raw/main/CROSSWD.TXT')
for line in words_file:
    word = line.decode('utf-8').strip()
    if is_alphabetical(word):
      print(word)

abet
abhor
abhors
ably
abo
abort
abos
aby
ace
acers
aces
achy
act
ad
adept
adios
adit
ado
adopt
ados
ads
adz
ae
aegis
aery
aft
agin
agio
agios
agist
aglow
agly
ago
ah
ahoy
ai
ail
ails
aim
aims
ain
ains
air
airs
airt
airy
ais
ait
almost
alms
alow
alp
alps
alt
am
amort
amp
amps
amu
an
ant
any
apt
ar
ars
art
arty
as
at
aw
ay
be
befit
beg
begin
begins
begirt
begot
begs
beknot
bel
below
bels
belt
ben
bens
bent
best
bet
bevy
bey
bi
bijou
bijoux
bin
bins
bint
bio
biopsy
bios
bis
bit
blot
blow
blowy
bo
bop
bops
bort
borty
bortz
bos
bot
bow
box
boxy
boy
buy
by
ceil
ceils
celt
cent
chi
chimp
chimps
chin
chino
chinos
chins
chintz
chip
chips
chis
chit
chivy
chop
chops
chow
cist
city
clop
clops
clot
cloy
cop
cops
copy
cos
cost
cosy
cot
cow
cowy
cox
coy
coz
crux
cry
de
defi
defis
deft
defy
dehort
dei
deil
deils
deist
deity
del
dels
demo
demos
demy
den
dens
dent
deny
des
dev
dew
dewy
dex
dey
dhow
dim
dims
din
dins
dint
dip
dips
dipt
dirt
dirty
dit
do
dopy
dor
dors
dorty
dory
dos
dost
dot
doty
dow
dox