# Last Time
* We looked at the basics of loading a text file and reading its "words" into a list.
* We worked through some exercises related to lists, sets, and dictionaries.
* We took a look at strings and formatting.

# This Time
* We'll wrap up our discussion of major string operations.
* We'll talk about data distributions in Python

# Searching for Substrings
* Can search a string for a **substring** to accomplish many tasks:
    * determine whether a string contains a substring
    * count number of occurrences
    * determine the index at which a substring resides in a string
* Each method we take a look at compares characters lexicographically using their underlying numeric values (capitalization counts!)

### Counting Occurrences
* **`count`** returns the number of times its argument occurs in a string
* If you specify as the second argument a **start index**, `count` searches only the slice **_string_`[`*start_index*`:]`**
* If you specify as the second and third arguments the **start index** and **end index**,`count` searches only the slice **_string_`[`*start_index*`:`*end_index*`]`**

In [None]:
sentence = 'that that creature was there after moonfall was that which perplexed me'
print(sentence.count('that'))

In [None]:
print(sentence.count('that',6))

In [None]:
print(sentence.count('that',10,30))

### Locating a Substring in a String
* **`index`** searches for a substring within a string and returns the first index at which the substring is found
* If it is **not** found, a `ValueError` occurs.
* Like `count`, may have **start index** and **end index** arguments 

In [None]:
print(sentence) 
print(sentence.index('was'))

In [None]:
print(sentence.index('was',20))

In [None]:
print(sentence.index('was',20,40)) #will return a ValueError, since the substring is not present within this range

### Index Variants ###
* **`rindex`** performs the same operation as `index`, but searches from the end of the string backwards.
* **`find`** performs the task as `index` but returns `-1` if the substring isn't found.
* **`rfind`** performs the same operation as `find`, but searches from the end of the string backwards.

In [None]:
print(sentence)
print(sentence.index('that')) #first time substring is encountered
print(sentence.rindex('that')) #last time substring is encountered


In [None]:
print(sentence)
print(sentence.find('that',20,40))

### Determining if a String Contains a Substring Or Begins/Ends With One 
* To check whether a string contains a substring, use operator `in`
* The _method_ **`startswith`** returns `True` if the string starts with the specified substring
* The _method_ **`endswith`** returns `True` if the string starts with the specified substring


In [None]:
print(sentence)
print('that' in sentence)

In [None]:
print('koala' not in sentence)

In [None]:
print(sentence.startswith('that'))

In [None]:
print(sentence.endswith('that'))

# Replacing Substrings
* A common text manipulation is to locate a substring and replace its value
* **`replace`** searches a string for the substring in its first argument and replaces _each_ occurrence with the substring in its second argument
* Can receive an optional third argument specifying the maximum number of replacements
* Use an empty string (`''`) for the second value to "delete" any substring

In [None]:
values = '1\t2\t3\t4\t5' # \t is a tab character
print(values)

In [None]:
print(values.replace('\t', ','))
print(values.replace('\t', ',',2))
x=values.replace('1','10')
print(values)
print(x)

# Tokens and Tokenization
* The word **token** has several meanings in the context of computer science -- in the case of strings a single token generally represents an individual element of interest (e.g. a word or even a combination of symbols with a particular meaning).
* **Tokenization** is the process of splitting apart individual tokens from a larger collective (e.g. a string or file).
* Tokens typically are separated by whitespace characters such as blank, tab and newline, though other characters may be used—these separators are known as **delimiters**

### Splitting Strings
* Recall that `split` separates (i.e. tokenizes) parts of a string using a delimiter 
    * The default is to follow the approach above (any whitespace characters, including ' ', '\n', '\t') 
    * To tokenize a string at a custom delimiter, specify the delimiter string that `split` uses to tokenize the string
* `split` can have a specified maximum number of splits with an integer as the second argument
    * In such cases, the remaining portion of the string will remain a contiguous substring of the original.
    * `rsplit` can be used in lieu of `split` to process tokens from right to left.

In [None]:
lettersV1 = 'A B\t C\n D '
print(lettersV1.split(None)) #By default, any whitespace or combination of whitespace characters separate one token from the next


In [None]:
lettersV2 = 'A,B,C,D'
print(lettersV2.split()) #The default delimiter here won't work
print(lettersV2.split(',')) #Instead we can specify ',' as a custom separator.


In [None]:
print(lettersV2.split(',',2)) #tokenize for the first two occurrences of ',' -- leave the rest intact
print(lettersV2.rsplit(',',2)) #tokenize for the last two occurrences of ',' -- leave the rest intact

### Joining Strings
* **`join`** concatenates the strings in its argument, which must be an iterable containing only string values or expressions
* The separator between the concatenated items is the _string_ on which you call `join`
    * Use an empty string to simply join the iterable elements as they are
* List comprehensions can be joined in this way with ease.

In [None]:
letters_list = ['A', 'B', 'C', 'D']

In [None]:
print('||'.join(letters_list)) #join with pipes separating elements of the original iterable
print(''.join(letters_list)) #join with nothing separating elements of the original iterable

In [None]:
print(','.join([str(i*i) for i in range(10)])) #Join the results of a list comprehension that creates a list of strings

### String Methods `partition` and `rpartition` 
* String method **`partition`** splits a string into a tuple of three strings based on the method’s _separator_ argument
    * the part of the original string before the separator
    * the separator itself
    * the part of the string after the separator
* To search for the separator from the end of the string, use method **`rpartition`** 

In [None]:
name, separator, grades = 'Amanda: 89, 97, 92'.partition(': ')
print(name)
print(separator)
print(grades)

In [None]:
url = 'http://www.deitel.com/books/PyCDS/table_of_contents.html'
rest_of_url, separator, document = url.rpartition('/')
print(rest_of_url)
print(separator)
print(document)

### String Method `splitlines` 
* **`splitlines`** returns a list of new strings representing lines of text split at each newline character in the original
string
* Passing `True` to `splitlines` keeps the newlines

In [None]:
lines = """This is line 1
This is line2
This is line3"""

In [None]:
print(lines)
print(lines.splitlines())
print(lines.splitlines(True))

# Characters and Character-Testing Methods
* In Python, a character is simply a one-character string
* Python provides string methods for testing whether a string matches certain characteristics
    * **`isdigit`** returns `True` if the string on which you call the method contains only the digit characters (`0`–`9`)
    * **`isalnum`** returns `True` if the string on which you call the method is alphanumeric (only digits and letters)
* Character-testing is critical for validating input or searching for specific data-types.
   

In [None]:
print('27'.isdigit())
print('-27'.isdigit()) #EVERY element in the string must be a digit for this to return true

In [None]:
print('A9876'.isalnum())
print('123 Main Street'.isalnum()) # Whitespace characters are NOT alphanumeric

* Table of many character-testing methods

| String Method | Description
| -------- | --------
| `isalnum()` | Returns `True` if the string contains only _alphanumeric_ characters (i.e., digits and letters).
| `isalpha()`  | Returns `True` if the string contains only _alphabetic_ characters (i.e., letters).
| `isdecimal()`  | Returns `True` if the string contains only _decimal integer_ characters (that is, base 10 integers) and does not contain a + or - sign.
| `isdigit()`  | Returns `True` if the string contains only digits (e.g., '0', '1', '2').
| `isidentifier()`  | Returns `True` if the string represents a valid _identifier_. 
| `islower()`  | Returns `True` if all alphabetic characters in the string are _lowercase_ characters (e.g., `'a'`, `'b'`, `'c'`).
| `isnumeric()`  | Returns `True` if the characters in the string represent a _numeric value_ without a + or - sign and without a decimal point.
| `isspace()`  | Returns `True` if the string contains only _whitespace_ characters.
| `istitle()`  | Returns `True` if the first character of each word in the string is the only _uppercase_ character in the word. 
| `isupper()`  | Returns `True` if all alphabetic characters in the string are _uppercase_ characters (e.g., `'A'`, `'B'`, `'C'`).

# Raw Strings
* Backslash characters in strings introduce _escape sequences_ — like `\n` for newline and `\t` for tab
* To include a backslash in a string, use two backslash characters `\\`
* These can occasionally make some strings relevant to filepaths difficult to read or interpret
* Consider a Microsoft Windows file location: _C:\MyFolder\MySubFolder\MyFile.txt_ it's string representation will really be 'C:\\MyFolder\\MySubFolder\\MyFile.txt'
* **raw strings**—preceded by the character `r` are often more convenient
* They treat each backslash as a regular character, rather than the beginning of an escape sequence

In [None]:
file_path = 'C:\\MyFolder\\MySubFolder\\MyFile.txt'
print(file_path)

In [None]:
file_path = r'C:\MyFolder\MySubFolder\MyFile.txt'
print(file_path)

# Data Distributions
* A **data distribution** is simply a collection of data *projected* through the lens of a subset of variables (usually one). 
* A common data analysis/science task is associating a given data distribution with a *probability distribution*.
* A **probability distribution** describes how likely a variable is to possess values within the allowable range. 
     * Correctly associating data with probability distributions can help us understand underlying data characteristics and enchance our ability in future prediction.
    
* We will briefly address tasks of generating data according to distributions and fitting data to distributions in Python.
     * Since this is *not* a course on statistics, we will restrict our discussion to simple distributions, and address the topic in  limited detail.

### A Simple Start: The Uniform Distribution
* A **uniform distribution** is one wherein all values (or outcomes) are equally likely to appear (or occur).
     * Uniform distributions may be *discrete*, in which case a finite number of values can appear.
     * They can likewise be *continuous*, in which case any arbitrary value in a set range is equally likely to appear.
* It is possible use the **numpy.random.uniform** function to generate uniformly distributed data.
     * Instead, however, we will rely on the familiar task of *rolling dice* to produce and analyze uniform distributions. 

In [None]:
import numpy as np
# We can generate 1000 and 1 million die rolls from scratch using numpy's randint function
dierollarrays = []
dierollarrays.append(np.random.randint(1,7,size=1_000)) #1000 die rolls
dierollarrays.append(np.random.randint(1,7,size=1_000_000)) #1 million die rolls

In [None]:
import matplotlib.pyplot as plt
#Let's next generate a familiar set of plots for our die rolls 
for drarray,spi,disttitle in zip(dierollarrays,[1,2], ['1_000','1_000_000']):
    plt.subplot(1,2,spi)
    rollcounts = np.bincount(drarray)[1:] #get all counts from 0 to 6, and ignore "bin" for 0
    plt.bar(range(1,7),rollcounts)    
    plt.title(str(disttitle).replace('_',',') + ' Die Rolls')
fig = plt.gcf()
fig.set_size_inches(18.5, 5)
plt.show()

### Normal Distribution
* A **normal distribution** is one wherein values are densely packed around a symmetric mean, with the likelihood of given values appearing falling off as the distance from the mean increases.
* The normal distribution is known for its excellent ability to *approximate* distributions of many natural phenomena (test scores, vital sign measurements, etc.)
* Unlike the other distributions we look at today, the normal distribution relies on two parameters:
     * The first (distribution mean) indicates where data are likely to be clustered
     * The second (distribution standard deviation) indicates how quickly value likelihood falls off.
          * Almost 70% of the data will fall within a single deviation of the mean, while about 95% will fall within two standard deviations.
* The easiest way to generate an array of normally distributed data is with the   **numpy.random.normal** function.
     * Parameters include distribution mean (*mu*), standard deviation (*sigma*), and number of samples.

In [None]:
import numpy as np
mu, sigma = 0, 0.1 # mu is the mean and sigma is the standard deviation
normarrays=[]
normarrays.append(np.random.normal(mu, sigma, 1_000))
normarrays.append(np.random.normal(mu, sigma, 1_000_000))

In [None]:
import matplotlib.pyplot as plt
for normarray,spi,disttitle in zip(normarrays,[1,2], ['1000 Normal Dist. Samples','1 Million Normal Dist. Samples']):
    plt.subplot(1,2,spi)
    count, bins, _ = plt.hist(normarray, 51, density=True)
    mid = (bins[1:] + bins[:-1]) / 2
    expected = 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (mid - mu)**2 / (2 * sigma**2) )
    plt.bar(mid,count,width = 0.0001)
    plt.plot(mid,expected)
    plt.title(disttitle)
#plt.ylim((0,max(count+1)))
fig = plt.gcf()
fig.set_size_inches(18.5, 5)
plt.show()

### Distribution Fitting
* numpy makes generation of diverse distributions of data simple and efficient.
* However, numpy does not directly address *distribution fitting* 
* The goal of **distribution fitting** is to find the correct distribution type from which a set of data were generated, as well as the underlying parameters for generation.
    * As a general rule, distribution fitting is a far more challenging process than distribution generation.
    * While there are many packages that address the task, we will use *distfit*.

### The distfit Package
* The **distfit** package, authored by *Erdogan Taskesen*, is a maintained, custom library primarily used to fit data to data distributions.
* distfit may be installed using pip, or in anaconda using `conda install -c conda-forge distfit`
* It is an extremely flexible package whose functionality extends to creating and applying custom distribution functions to distribution fitting tasks.
    * For our purposes, we will restrict our usage to fitting our generated data to standard distribution types using residual sum of squares (RSS) based criteria.
         * Here, RSS is used to gauge the deviation of a distribution from an empirical model.  (The smaller the RSS value is for given data and distribution, the more likely there is to be a fit).

####          Let's now attempt to fit our normally distributed arrays to distributions.

In [None]:
from time import sleep #Used to prevent fit_transform executions from overlapping
from distfit import distfit #We only need the basic distfit object for distribution fitting
dist = distfit(distr=['uniform','norm','expon','dweibull']) #Only include these four distributions for consideration
for normarray,disttitle in zip(normarrays,['1000 Normal Dist. Samples','1 Million Normal Dist. Samples']):
    print('Fitting ' + disttitle + ' to uniform, normal, exponential, and Weibull distributions:')
    sleep(0.5) #Make sure all info from previous iteration of fit_transform has already been printed
    details=dist.fit_transform(normarray) #fit_transform will print RSS data we need; output includes detailed specifics
    print()
    

### Exponential Distribution
* An **exponential distribution** is one wherein values represent time points that occur according to a *Poisson point process*, wherein events occur independently with a constant average rate -- but potentially wide range of intervals.
     * For example, suppose a busy tax lawyer gets texts at an average rate of 10 minutes during tax season.
          * Many texts will arrive with an interval of 10 minutes or less.
          * However, an occasional interval between texts that is *much* longer (1 or 2 hours, etc.) may occur.
* The exponential distribution is mechanically simple, relying solely on an (average) rate parameter.
     * More general distributions appropriate for interval data include the *Gamma* and *Weibull* distributions.
* Arrays of exponentially distributed data can be generated with the **numpy.random.exponential** function.
     * Parameters include the average rate and number of samples.

In [None]:
exparrays=[]
rate = 10
exparrays.append(np.random.exponential(rate, 1_000))
exparrays.append(np.random.exponential(rate, 1_000_000))

In [None]:
import matplotlib.pyplot as plt
for exparray,spi,disttitle in zip(exparrays,[1,2], ['1000 Exp. Dist. Samples','1 Million Exp. Normal Dist. Samples']):
    plt.subplot(1,2,spi)
    count, bins, _ = plt.hist(exparray, 51, density=True)
    mid = (bins[1:] + bins[:-1]) / 2
    plt.bar(mid,count,width = 0.0001)
    plt.title(disttitle)
fig = plt.gcf()
fig.set_size_inches(18.5, 5)
plt.show()



####          We can also attempt to fit our exponentially distributed arrays to distributions.

In [None]:
#Let's now attempt to fit our exponentially distributed arrays to distributions
for exparray,disttitle in zip(exparrays,['1000 Exp. Dist. Samples','1 Million Exp. Normal Dist. Samples']):
    print('Fitting ' + disttitle + ' to uniform, normal, exponential, and Weibull distributions:')
    sleep(0.5) #Make sure all info from previous iteration of fit_transform has already been printed
    details=dist.fit_transform(exparray) #fit_transform will print RSS data we need; output includes detailed specifics
    print()