<a href="https://colab.research.google.com/github/niklaust/Data_Science/blob/main/Python_for_Data_Analysis_notebook_of_niklaust.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Reference**
Wes McKinney. (2022). *Python for Data Analysis Data Wrangling with pandas, NumPy, and Jupyter, Third Edition*. O'Reilly

github:niklaust

start 20230327

<h1><center><b>Data Analysis</b></center></h1>

# <center><b>Chapter 1. Preliminaries</b></center>

## **1.1 What we will learn about?**

An adequate preparation to enable to move on to a more domain-specific resource.

We will learn about:
* manipulating
* processing
* cleaning
* crunching data

to become an effective data analyst.

**What kinds of Data?**

The primary focus is on **structured data**

* **Tabular** or **spreadsheet-like data** in which each column maybe a different type (string, numeric, date, or otherwise). This includes most kinds of data commonly stored in relational databases or tab- or comma-delimited text files.
* **Multidimensional arrays** (matrics).
* Multiple tables of data interrelated by key columns (what would be primary or foreign keys for a **SQL** user).
* Evenly or unevenly spaced **time series**.

## **1.2 Why Python for Data Analysis?**

Python has developed a large and active scientific computing and data analysis community.

Python become one of the most **important languages** for **data science**, **machine learning**, and general software development in academia and industry.


**Why Not Python?**

There ae a number of uses for which Python may be less suitable.

As Python is an interpreted programming language, in general most Python **code will run substantially slower** than code written in a compiled language like Java or C++.

## **1.3 Essential Python Libraries**

### **NumPy**

NumPy, short for Numerical Python, has long been a cornerstone of **numerical computing** in Python. It provides the data structures, algorithms, and library glue needed for most scientific applications involving numerical data in Python. NumPy contains, among other things:

* A fast and efficient multidimensional array object **ndarray**
* Functions for performing element-wise **computations with arrays** or **mathematical operations between arrays**
* **Tools** for **reading** and **writing** array-based datasets to disk
* **Linear algebra operations**, Fourier transform, and random number generation
* A **mature C API** to enbable Python **extensions** and native C or C++ code to access NumPy's data structures and computational facilities

NumPy arrays are more efficient for storing and manipulating data. NumPy arrays as a primary data structure or else target interoperability with NumPy.

### **pandas**

pandas provides **high-level data structures** and **functions designed** to make **working with structured** or **tabular data** intuitive and flexible. 

The primary objects in pandas that will be focus on are **DataFrame**, a tabular column-oriented data structure with both row and column labels, and the **Series**, a one-dimensional labeled array object.

pandas blends the array-computing ideas of NumPy with the kinds of data manipulation capabilities found in spreadsheets and relational databases (such as SQL). It provides convenient indexing functionality to enable you to reshape, slice and  dice, perform aggregations, and select subsets of data. Since **data manipulation**, **preparetion**, and **cleaning** are such important skills in data analysis.

### **matplotlib**

matplotlib is the most popular Python library for **producing plots** and **other two-dimensional data visualizations**. 

### **IPython**

IPython is a programming tool designed to facilitate **interactive computing** and **software development work**. The tool is unique in that it encourages an execute-explore workflow rather than the typical edit-compile-run workflow of other programming languages. Additionally, IPython provides access to the operating system's shell and filesystem, which reduces the need for users to switch between a terminal window and a Python session

### **SciPy**

SciPy is a collection of packages **addressing a number of foundational problems** in **scientific computing**. 

* `scipy.integrate` : Numerical intergration routines and differential equation solvers
* `scipy.linalg` : Linear algebra routines and matrix decompositions extending beyound those provided in `numpy.linalg` 
* `scipy.optimize` : Function optimizers (minimizers) and root finding algorithms
* `scipy.signal` : Signal processing tools
* `scipy.sparse` : Sparse matrices and sparse linear system solvers 
* `scipy.special` : Wrapper around SPECFUN, a FORTRAN library implementing many common mathematical functions, such as the `gamma` function
* `scipy.stats` : Standard continuous and discrete probability distributions (density functions, samples, continuous distribution functions), various statistical tests, and more descriptive statistics.

Together, NumPy and SciPy from a resonably complete and mature computational foundation for many traditional scientific computing applications.

### **scikit-learn**

scikit-learn has become the premier general-purpose **machine learning toolkit for Python programmers**. As of this writing, more than two thousand different individuals have contributed code to the project. It includes submodules for such models as:

* **Classification**: SVM, nearest neighbors, random forest, logistic regression, etc.
* **Regression**: Lasso, ridge regression, etc.
* **Clustering**: k-means, spectral clustering, etc.
* **Dimensionality** reduction: PCA, feature selection, matrix factorization, etc.
* **Model selection**: Grid search, cross-validation, metrics
* **Preprocessing**: Feature extraction, normalization

### **statismodels**

Statsmodels is a **statistical analysis package**, which is implemented a number of **regression analysis models** popular in the R programming language.

Compared with scikit-learn, statsmodels contains algorithms for classical (primarily frequentist) **statistics** and **econometrics**. This includes such submodules as:

* **Regression models**: linear regression, generalized linear models, robust linear models, linear mixed effects models, etc.
* **Analysis of variance** (ANOVA)
* **Time series analysis**: AR, ARMA, ARIMA, VAR, and other models
* **Nonparametric methods**: Kernel denisty estimation, kernel regression
* **Visualization of statistical mode**l results

statsmodels is more focused on **statistical inference**, providing uncertainty estimates and p-values for parameters. scikit-learn, by contrast, is more prediction focused. 


Guideline for different end goals for their work, the tasks required generally fall into a number of different broad groups:

* **Interacting with the outside world** - Reading and writing with a variety of file format and data stores
* **Preparation** - Cleaning, munging, combining, normalizing, reshaping, slicing and dicting, and transforming data for analysis
* **Transformation** - Applying mathematical and statistical operations to groups datasets to derive new datasets (e.g., aggregating a large table by group variables)
* **Modeling and computation**  - Connneting your data to statistical models, machine learning algorithms, or other computational tools
* **Presentation** - Creating interactive or static graphical visualizations or textual summaries

# <center><b>Chapter 2. Python Language Basics, IPython and Jupyter Notebooks </b></center>

An introductory text in working with data in Python.

Mostly, focuses on:
* Table-based analytics
* Data preparation tools for working with data sets

Sometime, we do some wrangling to arrange messy data into a more nicely tabular (or structured) form.

## **2.1 The Python Interpreter**

Python is an **interpreted language**. The Python interpreter **runs a program by executing one statement at a time**.

Some do data analysis or scientific computing make use of **IPython**, an enhanced Python interpreter, or **Jupyter notebooks**, Web-based code notebooks originally created within the IPython project.

## **2.2 IPython Basics**

IPython provides facilities to execute arbitrary blocks of code (via a somewhat glorified copy-and-paste approach) and whole Python scripts. 

In [None]:
a = 5
a

5

In [None]:
import numpy as np

data = [np.random.standard_normal() for i in range(7)]
data

[-0.0897850222318582,
 1.7311433792888753,
 1.4131745614923839,
 -0.151770027838262,
 1.3853420237700667,
 -0.3713195143206432,
 0.35114147658826644]

**Jupyter Notebook**



One of the major components of the Jupyter project is the **notebook**, a type of **interactive document for code**, text (including Markdown), data visualizations, and other output.

**Introspection**

Using a question mark (?) before or after a variable will display some general information about the object:

In [None]:
b = [1, 2, 3]

In [None]:
b?

In [None]:
print?

In [None]:
def add_numbers(a, b):
  """
  Add two numbers together
  
  Returns 
  --------
  the_sum : type of arguments
  """
  return a + b

In [None]:
add_numbers?

? has a final usage, which is for searching the IPython namespace in a manner similar to the standard Unix or Windows command line. A number of characters combined with the wildcard (\*) will show all names matching the wildcard expression. For example, we could get a list of all functions in the top-level NumPy namespace containing `load:`

In [None]:
import numpy as np 

In [None]:
np.*load*?

## **2.3 Python Language Basics**

**Language Semantics**

The Python language design is distinguished by its **emphasis on readability**, **simplicity**, and **explicitness**. Some people go so far as to liken it to "executable pseudocode."

Python uses **whitespace** (tabs or spaces) to structure code instead of using braces as in many other languages

A **colon** denotes the **start of an indented code block** after which all of the code must be indented by the same amount until the end of the block.

**Semicolons** can be used to **separate multiple statements** on a single line, Howeve, Putting multiple statements on one line is generally discouraged in Python as it can make code less readable.


**Comments** any text preceded by the hash make (pound sign) `#` is ignored by the Python interpreter. this is often used to add comments to code.

**Variables and argument passing**

When **assigning a variable** (or name) in Python, you are **creating a reference to the object** shown on the righthand side of the equals sign. In practical terms, consider a list of integers:


In [None]:
a = [1, 2, 3]

In [None]:
b = a 

In [None]:
b

[1, 2, 3]

In some languages, the assignment if `b` will casue the `data[1, 2, 3]` to be **copied.** 

In **Python**, `a` and `b` actually **now refer to the same object**, the original list `[1,2,3]`

In [None]:
a.append(4)

In [None]:
b

[1, 2, 3, 4]

When you pass objects as arguments to a **function**, new **local variables are created** referencing the original objects without any copying. If you bind a new object to a variable inside a function, that will not overwrite a variable of the same name in the "scope" outside of the function (the "parent scope"). It is therefore possible to alter the internals of a mutable argument. 

In [None]:
def append_element(some_list, element):
  some_list.append(element)

In [None]:
data = [1, 2, 3]
append_element(data, 4)

In [None]:
data

[1, 2, 3, 4]

**Dynamic references, strong types**

Variables in Python have **no inherent type** associated with them; a variable can refer to a different type of object simply by doing an assignment. 

In [None]:
a = 5

In [None]:
type(a)

int

In [None]:
a = "foo"

In [None]:
type(a)

str

Python is a **strongly type language**, which means that **every object has a specific type (or class),** and implicit conversions will occur only in certain permitted circumstances.

In [None]:
a = 4.5
b = 2

In [None]:
# String formatting, to be visited later
print(f"a is {type(a)}, b is {type(b)}")

a is <class 'float'>, b is <class 'int'>


In [None]:
a / b

2.25

**Knowing the type of an object is important**, and it's useful ot be able to write functions that can handle many different kinds of input. You can check that an objectis an instance of a particular type using the `isinstance` function:

In [None]:
a = 5

isinstance(a, int)

True

`isinstance` can accept a tuple of types if you want to check that an objct's type is among those present in the tuple:

In [None]:
a = 5
b = 4.5

In [None]:
isinstance(a, (int, float))

True

In [None]:
isinstance(b, (int, float))

True

**Attributes and methods**

Objects in Python typically have both: 

* **attributes** (other Python objects stored "inside" the object) 
* **methods** (functions associated with an object that can have access to the object's internal data). 

Both of them are accessed via the syntax `obj.attribute_name:`

In [None]:
a = "foo"

In [None]:
getattr(a, "split")

<function str.split(sep=None, maxsplit=-1)>

**Duck typing**

Often you may **not care about the type of an object** but rather only whether it has **certain methods or behavior**. This is sometimes called duck typing,

In [None]:
def isiterable(obj):
  try:
    iter(obj)
    return True
  except TypeError: # not iterable
    return False

In [None]:
isiterable("a string")

True

In [None]:
isiterable([1, 2, 3])

True

In [None]:
isiterable(5)

False

**Imports**

In Python, a module is simply a file with the `.py` extension containing Python code. Suppose we had the following module:

In [None]:
# some_module.py
PI = 3.14159

def f(x):
  return x + 2

def g(a, b):
  return a + b

If we wanted to access the variables and functions defined in `some_module.py`, from another file in the same directory we could do:

In [None]:
import some_module

result = some_module.f(5)
pi = some_module.PI

In [None]:
from some_module import g, PI

result = g(5, PI)

In [None]:
import some_module as sm
from some_module import PI as pi, g as gf 

r1 = sm.f(pi)
r2 = gf(6, pi)

**Binary operators and comparisions**

In [None]:
5 - 7

-2

In [None]:
12 + 21.5

33.5

In [None]:
5 <= 2

False

<table>
  <tr>
    <th colspan="2"><h4><b>Binary Operators</b></h4></th>
  </tr>
  <tr>
    <th>Operation</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>a + b</td>
    <td>Add a and b
</td>
  </tr>
  <tr>
    <td>a - b</td>
    <td>Subtract b from a</td>
  </tr>
  <tr>
    <td>a * b</td>
    <td>Multiply a by b</td>
  </tr>
  <tr>
    <td>a / b</td>
    <td>Divide a by b</td>
  </tr>
  <tr>
    <td>a // b</td>
    <td>Floor-divide a by b, dropping any fractional remainder</td>
  </tr>
  <tr>
    <td>a ** b </td>
    <td>Raise a to the b power</td>
  </tr>
  <tr>
    <td>a & b</td>
    <td>True if both a and b are True; for integers, take the bitwise AND</td>
  </tr>
  <tr>
    <td>a | b</td>
    <td>True if either a or b is True; for integers, take the bitwise OR</td>
  </tr>
  <tr>
    <td>a ^ b</td>
    <td>For Booleans, True if a or b is True, but not both; for integers, take the bitwise EXCLUSIVE-OR</td>
  </tr>
  <tr>
    <td>a == b</td>
    <td>True if a equals b</td>
  </tr>
  <tr>
    <td>a != b</td>
    <td>True if a is not equal to b</td>
  </tr>
  <tr>
    <td>a < b, a <= b</td>
    <td>True if a is less than (less than or equal to) b</td>
  </tr>
  <tr>
    <td>a > b, a >= b</td>
    <td>True if a is greater than (greater than or equal to) b</td>
  </tr>
  <tr>
    <td>a is b </td>
    <td>True if a and b reference the same Python objec</td>
  </tr>
  <tr>
    <td>a is not b</td>
    <td>True if a and b reference different Python object</td>
  </tr>
</table>

`is not` to check that two objects are not the same:

In [None]:
a = [1, 2, 3]
b = a 
c = list(a)

In [None]:
a is b

True

In [None]:
a is not c

True

`list` fuction always create a new Python list (i.e., a copy), we can be sure that `c` is distinct from `a`. Comparing with `is` is not the same as the `==` operator.

In [None]:
a == c

True

A common use of `is` and `is not` is to check if a variable is `None`, since there is only one instance of `None`:

In [None]:
a = None

In [None]:
a is None

True

**Mutable and immutable objects**

* mutable object or values that they contain **can be modified** such as list, dictionaies, NumPy arrays, and most user-defined types (classes)

* immutable their internal data **cannot be changed** such as tuple

In [None]:
a_list = ["fool", 2, [4, 5]]
a_list[2] = (3, 4)
a_list

['fool', 2, (3, 4)]

In [None]:
a_tupe = (3, 5, (4, 5))

try:
  a_tuple[1] = "four"
except:
  print("error")

error


**Scalar Types**

Python has a small set of built-in types for handling numerical data, strings, Boolean (True or False) values, and dates and time. These "single value" types are sometimes called scalar types, and we refer to them scalars.

<table>
  <tr>
    <th colspan="2"><h4><b>Standard Python scalar types</b></h4></th>
  </tr>
  <tr>
    <th>Type</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>None</td>
    <td>The Python “null” value (only one instance of the None object exists)</td>
  </tr>
  <tr>
    <td>str</td>
    <td>String type; holds Unicode strings
</td>
  </tr>
  <tr>
    <td>bytes</td>
    <td>Raw binary data
</td>
  </tr>
  <tr>
    <td>float</td>
    <td>Double-precision floating-point number (note there is no separate double type)
</td>
  </tr>
  <tr>
    <td>bool</td>
    <td>A Boolean True or False value</td>
  </tr>
  <tr>
    <td>int</td>
    <td>Arbitrary precision integer</td>
  </tr>
</table>

**Numeric types**

The primary Python types for numbers are `int` and `float`. 

* `int` can store arbitararily larger numbers
* `float` number with a double-precision value.

In [None]:
ival = 123456
ival ** 6

3540570200530940541182574329856

In [None]:
fval = 7.25
fval2 = 6.78e-5

In [None]:
3/2

1.5

In [None]:
9//2

4

**Strings**

You can write string literals using either single quotes `'` or double quotes  `"` (double quotes are generally favored)

In [None]:
a = 'one way of writing a string'
b = "another way"

For multi-line  strings with line breaks, you can use triple quotes, `'''` or `"""`:

In [None]:
c = """
This is a longer string that 
spans multiple lines
"""

In [None]:
c.count("\n")

3

Python strings are immutable, you cannot modify a string:

In [None]:
a = "this is a string"

try: 
  a[10] = "f"
except:
  print("error")

error


In [None]:
b = a.replace("string", "longer string")
b 

'this is a longer string'

In [None]:
a 

'this is a string'

In [None]:
a = 5.6

s = str(a)
print(s, type(s))

5.6 <class 'str'>


In [None]:
s = "python"

list(s)

['p', 'y', 't', 'h', 'o', 'n']

In [None]:
s[:3]

'pyt'

In [None]:
s = "12\\34"

print(s)

12\34


In [None]:
# r: raw string
s = r"this\has\no\special\characters"

s

'this\\has\\no\\special\\characters'

In [None]:
# concatenates two strings
a = "this is the first half"
b = "and this is the second half"

a + b 

'this is the first halfand this is the second half'

In [None]:
template = "{0:.2f} {1:s} are worth US${2:d}"

# {0:.2f} means to format the first argument as a floating-point number with two decimal places
# {1:s} means to format the second argument as a string.
# {2:d} means to format the thrid argument as an exact integer

template.format(88.46, "Argentine Pesos", 1)

'88.46 Argentine Pesos are worth US$1'

In [None]:
amount = 10
rate = 88.46
currency = "Pesos"

result = f"{amount} {currency} is worth US${amount / rate}"
result

'10 Pesos is worth US$0.11304544426859599'

In [None]:
f"{amount} {currency} is worth US${amount/ rate:.2f}"

'10 Pesos is worth US$0.11'

**Bytes and Unicode**

In [None]:
val = "español"

val 

'español'

In [None]:
val_utf8 = val.encode("utf-8")

val_utf8

b'espa\xc3\xb1ol'

In [None]:
type(val_utf8)

bytes

In [None]:
val_utf8.decode("utf-8")

'español'

While it is now preferable to use UTF-8 for any encoding, for historical reasons you may encounter data in any number of different encodings:

In [None]:
val.encode("latin1")

b'espa\xf1ol'

In [None]:
val.encode("utf-16")

b'\xff\xfee\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00'

In [None]:
val.encode("utf-16le")

b'e\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00'

**Booleans**

The two Boolean values in Python are written as `True` and `False`. Comparisions and other conditional expressions evaluate to either `True` or `False`. Boolean values are combined with the `and` and `or` keywords:

In [None]:
True and True

True

In [None]:
False or True

True

In [None]:
int(False)

0

In [None]:
int(True)

1

In [None]:
a = True 
b = False

In [None]:
not a 

False

In [None]:
not b

True

**Type casting**

The `str`, `bool`, `int` and `float` types are also functions that can be used to cast values to those types:


In [None]:
s = "3.141459"

fval = float(s)
type(fval)

float

In [None]:
int(fval)

3

In [None]:
bool(fval)

True

In [None]:
bool(0)

False

**None**

`None` is the Python null value type:

In [None]:
a = None

a is None

True

In [None]:
b = 5

b is not None

True

In [None]:
def add_and_maybe_multiply(a, b, c=None):
  result = a + b

  if c is not None:
    result = result * c 
  
  return result

**Dates and times**

The built-in Python `datetime` module provides `datetime`, `date`, and `time` types. The `datetime` type combines the information stored in `date` and `time` and is the most commonly used:

In [None]:
from datetime import datetime, date, time

dt = datetime(2023, 3, 8, 19, 59, 30) # year, month, day, hour, minute, second

In [None]:
dt.month

3

In [None]:
dt.second

30

In [None]:
dt.date()

datetime.date(2023, 3, 8)

In [None]:
dt.time()

datetime.time(19, 59, 30)

In [None]:
dt.strftime("%Y-%m-%d %H:%M")

'2023-03-08 19:59'

In [None]:
dt_hour = dt.replace(minute=0, second=0)

dt_hour

datetime.datetime(2023, 3, 8, 19, 0)

In [None]:
dt

datetime.datetime(2023, 3, 8, 19, 59, 30)

In [None]:
dt2 = datetime(2023, 4, 15, 19, 59, 30)

delta = dt2 - dt
delta

datetime.timedelta(days=38)

In [None]:
type(delta)

datetime.timedelta

In [None]:
dt 

datetime.datetime(2023, 3, 8, 19, 59, 30)

In [None]:
dt + delta

datetime.datetime(2023, 4, 15, 19, 59, 30)

**Control Flow**

Python has several built-in keywords for **conditional logic**, **loops**, and **other standard control flow concepts** found in other programming languages.

**if, elif, and else**

In [None]:
x = -5 

if x < 0:
  print("It's negative")

It's negative


In [None]:
if x < 0:
  print("It's negative")
elif x == 0:
  print("Equal to zero")
elif 0 < x < 5:
  print("Positive but smaller than 5")
else:
  print("Positive and larger than or equal to 5")

It's negative


In [None]:
a = 5; b = 7
c = 8; d = 4

if a < b or c > d:
  print("Made it")

Made it


**for loops**

`for` loops are for iterating over a collection (like a list or tuple) or an iterater. 

In [None]:
sequence = [1, 2, None, 4, None, 5]
total = 0

for value in sequence:
  if value is None:
    continue
  total += value

print(total)

12


In [None]:
sequence = [1, 2, 0, 4, 6, 5, 2, 1]
total_until_5 = 0

for value in sequence:
  if value == 5:
    break 
  total_until_5 += value

print(total_until_5)

13


In [None]:
for i in range(4):
  for j in range(4):
    if j > i:
      break
    print((i, j))

(0, 0)
(1, 0)
(1, 1)
(2, 0)
(2, 1)
(2, 2)
(3, 0)
(3, 1)
(3, 2)
(3, 3)


**while loops**

A `while` loop specifies a condition and a block of code that is to be executed until the condition evaluates to `False` or the loop is explicitly ended with `break`

In [None]:
256

In [None]:
x = 256
total = 0

while x > 0:
  if total > 500:
    break
  total += x
  x = x // 2
  print(total, x)


256 128
384 64
448 32
480 16
496 8
504 4


**pass**

`pass` is the **"no-op" (or "do nothing") statement** in Python. It can be used in blocks where no action is to be taken (or as a placeholder for code not yet implemented); it is required only because Python used whitespace to delimit blocks:


In [None]:
if x < 0:
  print("negative!")
elif x == 0:
  # TODO: put something smart here
  pass
else:
  print("positive!")

positive!


**range**

The `range` function generates a sequence of evenly spaced integers:

In [None]:
range(10)

range(0, 10)

In [None]:
list(range(10))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [None]:
list(range(0, 20, 2))      # start, stop, step

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

In [None]:
list(range(5, 0, -1))

[5, 4, 3, 2, 1]

In [None]:
seq = [1, 2, 3, 4]

for i in range(len(seq)):
  print(f"element {i}: {seq[i]}")

element 0: 1
element 1: 2
element 2: 3
element 3: 4


In [None]:
total = 0

for i in range(100_000):
  # % is the modulo operator
  if i % 3 == 0 or i % 5 == 0:   # accumulate all the number that can divide by 3 or 5  
    total += i

print(total)

2333316668


# <center><b>Chapter 3. Built-In Data Structures, Functions, and Files</b></center>

## **3.1 Data Structures and Sequences**

### **Tuple**

A tuple is a fixed-length, immutable sequence of Python objects which, once assigned, cannot be changed. The easiest way to create one is with a comma-separated sequence of values wrapped in parentheses:

In [1]:
tup = (4, 5, 6)
tup

(4, 5, 6)

In [4]:
tup = 4, 5, 6
tup

(4, 5, 6)

In [3]:
tuple([4, 0, 2])

(4, 0, 2)

In [5]:
tup = tuple('string')
tup

('s', 't', 'r', 'i', 'n', 'g')

In [6]:
tup[0]

's'

In [7]:
nested_tup = (4, 5, 6), (7, 8)
nested_tup

((4, 5, 6), (7, 8))

In [8]:
nested_tup[0]

(4, 5, 6)

In [9]:
nested_tup[1]

(7, 8)

In [14]:
tup = tuple(['foo', [1,2], True])

try:
  tup[2] = False
except:
  print("error")

error


In [15]:
# If an object inside a tuple is mutable, such as a list, you can modify it in place.
tup[1].append(3)

tup

('foo', [1, 2, 3], True)

In [16]:
# concatenate tuples using the + operator to produce longer tuples:
(4, None, 'foo') + (6, 0) + ('bar',)

(4, None, 'foo', 6, 0, 'bar')

In [17]:
# multiplying a tuple by an integer
('foo', 'bar') * 4

('foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar')

**Unpacking tuples**

In [18]:
tup = (4, 5, 6)

a, b, c = tup
b

5

In [19]:
tup = 4, 5, (6, 7)

a, b, (c, d) = tup
d

7

In [22]:
# swap variable names, in many languages might look like this 

tmp = a 
a = b
b = tmp

print(a)
print(b)

5
5


In [23]:
# swap variable names, in Python

a, b = 1, 2

print(f"a --- {a}, b --- {b}")

b, a = a, b

print(f"a --- {a}, b --- {b}")

a --- 1, b --- 2
a --- 2, b --- 1


In [24]:
seq = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

for a, b, c in seq:
  print(f'a={a}, b={b}, c={c}') 

a=1, b=2, c=3
a=4, b=5, c=6
a=7, b=8, c=9


In [26]:
# "pluck" a few element from the begining of a tuple.
# *rest capture an arbitrarily long list of positional arguments

values = 1, 2, 3, 4, 5
a, b, *rest = values

print(f"a --------- {a}")
print(f"b --------- {b}")
print(f"rest ------ {rest}")

a --------- 1
b --------- 2
rest ------ [3, 4, 5]


In [27]:
#we use the underscore (_) to discard unwanted variables
a, b, *_ = values

**Tuple methods**

`count` counts the number of occurrences of a value:

In [28]:
a = (1, 2, 2, 2, 2, 3, 4, 2)

a.count(2)

5

### **List**

List are variable length and their **contents can be modified** in place. List are mutable. You can define them using square brackets `[]` or using the `list` type function:

In [29]:
a_list = [2, 3, 7, None]
a_list

[2, 3, 7, None]

In [40]:
tup = ("foo", "bar", "baz")

b_list = list(tup)
b_list

['foo', 'bar', 'baz']

In [41]:
b_list[1] = "peekaboo"
b_list

['foo', 'peekaboo', 'baz']

In [33]:
gen = range(10)
gen 

range(0, 10)

In [34]:
list(gen)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

**Adding and removing elements**

`append` -add an element to the end of the list

`insert` -insert an element at a specific location in the list

`pop` - removes and returns an element at a particular index

`remove` -remove element by value, which locates the first such value and removes it from the list

In [42]:
b_list.append("dwarf")
b_list

['foo', 'peekaboo', 'baz', 'dwarf']

In [43]:
b_list.insert(1, "red")
b_list

['foo', 'red', 'peekaboo', 'baz', 'dwarf']

In [44]:
b_list.pop(2)
b_list

['foo', 'red', 'baz', 'dwarf']

In [45]:
b_list.append("foo")
b_list

['foo', 'red', 'baz', 'dwarf', 'foo']

In [46]:
b_list.remove("foo")
b_list

['red', 'baz', 'dwarf', 'foo']

In [47]:
"dwarf" in b_list

True

In [48]:
"dwarf" not in b_list

False

**Concatenating and combining lists**

list concatenating by addition is a comparatively **expensive operation** since a new list must be created and the objects copied over. Using `extend` to append element to an existing list, especially if you are building up a large list, is usually preferable.

In [49]:
[4, None, "foo"] + [7, 8, (2, 3)]

[4, None, 'foo', 7, 8, (2, 3)]

In [50]:
x = [4, None, "foo"]

x.extend([7, 8, (2, 3)])
x

[4, None, 'foo', 7, 8, (2, 3)]

In [54]:
list_of_lists = [[0], [1, 1], [2, 2, 2]]

everything = []
for chunk in list_of_lists:
  everything.extend(chunk)

In [55]:
everything = []
for chunk in list_of_lists:
  everything = everything + chunk

**Sorting**

In [57]:
a = [7, 2, 5, 1, 3]

a.sort()
a

[1, 2, 3, 5, 7]

In [58]:
b = ["saw", "small", "He", "foxes", "six"]

b.sort(key=len)
b

['He', 'saw', 'six', 'small', 'foxes']

**Slicing**

slicing to **select sections of most sequence types** by using slice notation, which is its basic form consists of `start:stop` passed to the indexing operator `[]`

`start` index is included, the `stop` index in not included

`step` take every other element

In [63]:
#     -8 -7 -6 -5 -4 -3 -2 -1
#      0  1  2  3  4  5  6  7
seq = [7, 2, 3, 7, 5, 6, 0, 1]

seq[1: 5]

[2, 3, 7, 5]

In [64]:
seq[3: 5] = [6, 3]

seq

[7, 2, 3, 6, 3, 6, 0, 1]

In [65]:
seq[:5]

[7, 2, 3, 6, 3]

In [66]:
seq[3:]

[6, 3, 6, 0, 1]

In [67]:
seq[-4:]

[3, 6, 0, 1]

In [68]:
seq[-6:-2]

[3, 6, 3, 6]

In [70]:
#    -6   -5   -4   -3   -2   -1            
#     0    1    2    3    4    5
H = ['h', 'e', 'l', 'l', 'o', '!']

In [71]:
H[2: 4]

['l', 'l']

In [73]:
H[-5: -2]

['e', 'l', 'l']

In [74]:
seq[::2]

[7, 3, 3, 0]

In [75]:
# use negative value to reverse a lit or tuple
seq[::-1]

[1, 0, 6, 3, 6, 3, 2, 7]

### **Dictionary**

A **dictionary** (in other language called hash maps or associative arrays) stores a collection of **key-values pairs**, where key and value are Python objects. Each **key** is **associated with** a **value** so that a value can be conveniently retrieved, inserted, modified, or deleted given a particular key. One approach for creating a dictionary is to use curly braces `{}` and colons to separate keys and values:

In [76]:
empty_dict = {}

d1 = {"a": "some value", "b": [1, 2, 3, 4]}
d1

{'a': 'some value', 'b': [1, 2, 3, 4]}

In [77]:
d1[7] = "an integer"
d1

{'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}

In [78]:
d1["b"]

[1, 2, 3, 4]

In [79]:
"b" in d1

True

In [85]:
d1[5] = "some value"
d1

{'a': 'some value',
 'b': [1, 2, 3, 4],
 7: 'an integer',
 'dummy': 'another value',
 5: 'some value'}

In [82]:
d1["dummy"] = "another value"
d1

{'a': 'some value',
 'b': [1, 2, 3, 4],
 7: 'an integer',
 'dummy': 'another value'}

In [86]:
del d1[5]
d1

{'a': 'some value',
 'b': [1, 2, 3, 4],
 7: 'an integer',
 'dummy': 'another value'}

In [87]:
ret = d1.pop("dummy")
ret

'another value'

In [88]:
d1

{'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}

In [89]:
list(d1.keys())

['a', 'b', 7]

In [90]:
list(d1.values())

['some value', [1, 2, 3, 4], 'an integer']

In [91]:
list(d1.items())

[('a', 'some value'), ('b', [1, 2, 3, 4]), (7, 'an integer')]

In [92]:
d1.update({"b": "foo", "c": 12})
d1

{'a': 'some value', 'b': 'foo', 7: 'an integer', 'c': 12}

**Creating dicionaries from sequences**

In [93]:
key_list = ["apple", "banana", "cherry"]
value_list = [10, 20, 30]

mapping = {}
for key, value in zip(key_list, value_list):
  mapping[key] = value

In [94]:
mapping

{'apple': 10, 'banana': 20, 'cherry': 30}

In [95]:
# a dictionary is a collection of 2-tuples, the dict function accepts a list of 2-tuples
tuples = zip(range(5), reversed(range(5)))
tuples

<zip at 0x7f82d751ea80>

In [97]:
mapping = dict(tuples)
mapping

{0: 4, 1: 3, 2: 2, 3: 1, 4: 0}

**Default values**

In [99]:
some_dict = {'apple': 10, 'banana': 20, 'cherry': 30}
default_value = None

if key in some_dict:
  value = some_dict[key]
else: 
  value = default_value

In [105]:
value = some_dict.get('apple', default_value)
value

10

In [107]:
words = ["apple", "bat", 'bar', 'atom', "book"]
by_letter = {}

for word in words:
  letter = word[0]
  if letter not in by_letter:
    by_letter[letter] = [word]
  else:
    by_letter[letter].append(word)

by_letter

{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

In [108]:
by_letter = {}

for word in words:
  letter = word[0]
  by_letter.setdefault(letter, []).append(word)

by_letter

{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

In [110]:
from collections import defaultdict

by_letter = defaultdict(list)

for word in words:
  by_letter[word[0]].append(word)

by_letter

defaultdict(list, {'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']})

**Valid dictionary key types**

While the values of a dictionary can be any Python object, the keys generally have to be immutable objects like scalar types (int, float, string) or tuples (all the objects in the tuple need to be immutable, too). The technical term here is **hashability**. You can check whether an object is hashable (can be used as a key in a dictionary) with the hash function:

In [113]:
hash?

In [111]:
hash("string")

3384460652441790926

In [112]:
hash((1, 2, (2, 3)))

-9209053662355515447

In [118]:
try:
  hash((1, 2, [2, 3]))
except:
  print("error")

error


In [119]:
d = {}

d[tuple([1, 2, 3])] = 5
d

{(1, 2, 3): 5}

### **Set**

A **set** is an **unordered collection** of **unique elements.** A set can be created in two ways:
* via the `set` function 
* via a set literal with curly braces `{}`

In [120]:
set([2, 2, 2, 1, 3, 3])

{1, 2, 3}

In [121]:
{2, 2, 2, 1, 3, 3}

{1, 2, 3}

In [122]:
a = {1, 2, 3, 4, 5}
b = {3, 4, 5, 6, 7, 8}

In [123]:
a.union(b)

{1, 2, 3, 4, 5, 6, 7, 8}

In [124]:
a | b

{1, 2, 3, 4, 5, 6, 7, 8}

In [126]:
a.intersection(b)

{3, 4, 5}

In [127]:
a  & b

{3, 4, 5}

<table>
  <tr>
    <th colspan="3"><h4><b>Python set operations</b></h4></th>
  </tr>
  <tr>
    <th>Function</th>
    <th>Alternative syntax</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>a.add(x) </td>
    <td>N/A</td>
    <td>Add element x to set a</td>
  </tr>
  <tr>
    <td>a.clear() </td>
    <td>N/A</td>
    <td>Reset set a to an empty state, discarding all of its
elements</td>
  </tr>
  <tr>
    <td>a.remove(x) </td>
    <td>N/A</td>
    <td>Remove element x from set a
</td>
  </tr>
  <tr>
    <td>a.pop()</td>
    <td>N/A</td>
    <td>Remove an arbitrary element from set a, raising KeyError if the set is empty</td>
  </tr>
  <tr>
    <td>a.union(b)</td>
    <td>a | b </td>
    <td>All of the unique elements in a and b</td>
  </tr>
  <tr>
    <td>a.update(b)</td>
    <td>a |= b</td>
    <td>Set the contents of a to be the union of the elements in a and b</td>
  </tr>
  <tr>
    <td>a.intersection(b)</td>
    <td>a & b</td>
    <td>All of the elements in both a and b</td>
  </tr>
  <tr>
    <td>a.intersection_update(b) </td>
    <td>a &= b</td>
    <td>Set the contents of a to be the intersection of the elements in a and b</td>
  </tr>
  <tr>
    <td>a.difference(b) </td>
    <td>a - b </td>
    <td>The elements in a that are not in b</td>
  </tr>
  <tr>
    <td>a.difference_update(b)</td>
    <td>a -= b</td>
    <td>Set a to the elements in a that are not in b</td>
  </tr>
  <tr>
    <td>a.symmetric_difference(b)</td>
    <td>a ^ b </td>
    <td>All of the elements in either a or b but not both</td>
  </tr>
  <tr>
    <td>a.symmetric_difference_update(b)</td>
    <td>a ^= b</td>
    <td>Set a to contain the elements in either a or b but not both</td>
  </tr>
  <tr>
    <td>a.issubset(b) </td>
    <td><=</td>
    <td>True if the elements of a are all contained in b</td>
  </tr>
  <tr>
    <td>a.issuperset(b)</td>
    <td>>=</td>
    <td>True if the elements of b are all contained in a</td>
  </tr>
  <tr>
    <td>a.isdisjoint(b)</td>
    <td>N/A</td>
    <td>True if a and b have no elements in common</td>
  </tr>
</table>

In [129]:
a

{1, 2, 3, 4, 5}

In [131]:
b

{3, 4, 5, 6, 7, 8}

In [132]:
c = a.copy()

c |= b
c

{1, 2, 3, 4, 5, 6, 7, 8}

In [133]:
d = a.copy()

d &=b
d

{3, 4, 5}

In [134]:
my_data = [1, 2, 3, 4]

my_set = {tuple(my_data)}
my_set

{(1, 2, 3, 4)}

In [135]:
a_set = {1, 2, 3, 4, 5}

{1, 2, 3}.issubset(a_set)

True

In [136]:
a_set.issuperset({1, 2, 3})

True

In [137]:
{1, 2, 3} == {3, 2, 1}

True

### **Built-In Sequence Functions**

**enumerate**

`enumerate` returns a sequence of `(i, value)` tuple:

In [140]:
collection = ['apple', 'banana', 'cherry']
index = 0

for value in collection:
  print(index, value)
  index += 1

0 apple
1 banana
2 cherry


In [141]:
for index, value in enumerate(collection):
  print(index, value)

0 apple
1 banana
2 cherry


**sorted**

`sorted` function returns a new sorted list from the elements of any sequence:

In [142]:
sorted([7, 1, 2, 6, 0, 3, 2])

[0, 1, 2, 2, 3, 6, 7]

In [143]:
sorted("horse race")

[' ', 'a', 'c', 'e', 'e', 'h', 'o', 'r', 'r', 's']

**zip**

`zip` "pairs" up the elements of a number of lists, tuples, or other sequences to create a list of tuples:

In [144]:
seq1 = ["foo", "bar", "baz"]
seq2 = ["one", "two", "three"]

zipped = zip(seq1, seq2)
list(zipped)

[('foo', 'one'), ('bar', 'two'), ('baz', 'three')]

In [145]:
seq3 = [False, True]

list(zip(seq1, seq2, seq3))

[('foo', 'one', False), ('bar', 'two', True)]

In [146]:
for index, (a, b) in enumerate(zip(seq1, seq2)):
  print(f"{index}: {a}, {b}")

0: foo, one
1: bar, two
2: baz, three


**reversed**

`reversed` iterates over the elements of a sequence in revese order:

In [147]:
list(reversed(range(10)))

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

### **List, Set, and Dictionary Comprehensions**

List comprehensions are a convenient and widely used Python language feature. They allow you to concisely form a new list by filtering the elements of a collection, transforming the elements passing the filter into one concise expression. They take thie basic form:

    [expr for value in collection if condition]

This is equivalent to the following `for` loop:


    result = []
    for value in collection:
      if condition:
        result.append(expr)

In [149]:
strings = ["a", "as", "bat", "car", "dove", "python"]

[x.upper() for x in strings if len(x) > 2]

['BAT', 'CAR', 'DOVE', 'PYTHON']

A dictionary comprehension looks like this:

    dict_comp = {key-expr: value-expr for value in collection if condition}


A set comprehension looks like the equivalent list comprehension except with curly braces instead of square brackets:

    set_comp = {expr for value in collection if condition}

In [150]:
unique_lengths = {len(x) for x in strings}

unique_lengths

{1, 2, 3, 4, 6}

In [154]:
set(map(len, strings))

{1, 2, 3, 4, 6}

In [155]:
loc_mapping = {value: index for index, value in enumerate(strings)}

loc_mapping

{'a': 0, 'as': 1, 'bat': 2, 'car': 3, 'dove': 4, 'python': 5}

**Nested list comprehensions**

In [156]:
all_data = [["John", "Emily", "Michael", "Mary", "Steven"],
            ["Maria", "Juan", "Javier", "Natalia", "Pilar"]]

In [157]:
names_of_interest = []

for names in all_data:
  enough_as = [name for name in names if name.count("a") >= 2]  # name spelled with the letter "a" at least 2
  names_of_interest.extend(enough_as)

names_of_interest

['Maria', 'Natalia']

In [158]:
some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

flattened = [x for tup in some_tuples for x in tup]
flattened

[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [159]:
flattened = []

for tup in some_tuples:
  for x in tup:
    flattened.append(x)

flattened 

[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [160]:
[[x for x in tup] for tup in some_tuples]

[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

## **3.2 Functions**

Functions are the primary and most important method of code organization and reuse in Python. As a fule of thumb, if you anticipate needing to repeat the same or very similar code more than once, it may be worth writing a reusable function. Functions can also help make you code more readable by giving a name to a group of Python statements.

Functions are declared with `def` keyword. A function contains a block of code with an optional use of the `return` keyword

In [161]:
def my_function(x, y):
  return x + y

In [162]:
my_function(1, 2)

3

In [163]:
result = my_function(1, 2)
result

3

In [164]:
def function_without_return(x):
  print(x)

In [165]:
result = function_without_return("hello!")

hello!


In [166]:
print(result)

None


In [167]:
def my_function2(x, y, z=1.5):
  if z > 1:
    return z * (x + y)
  else:
    return z / (x + y)

In [168]:
my_function2(5, 6, z=0.7)

0.06363636363636363

In [169]:
my_function2(3.14, 7, 3.5)

35.49

In [170]:
my_function2(10, 20)

45.0

**Namespaces, Scope, and Local Functions**

Functions can access variables created inside the function as well as those outside the function in higher (or even global) scopes. An alternative and more descriptive name describing a variable scope in Python is a namespace. Any variables that are assigned within a function by default are assigned to the local namespace. The local namespace is created when the function is called an is immediately populated by the function's arguments. After the function is finished, the local namespace is destroyed.

In [191]:
def func():
  a = []                     # local variable
  for i in range(5):
    a.append(i)

In [193]:
func()
try: 
  a
except:
  print("error")

error


In [195]:
a = []                       

def func():
  for i in range(5):
    a.append(i)

In [196]:
func()
a

[0, 1, 2, 3, 4]

In [198]:
a = None

def bind_a_variable():
  global a 
  a = []

bind_a_variable()

In [199]:
print(a)

[]


**Returning Multiple Values**

In [200]:
def f():
  a = 5
  b = 6
  c = 7
  return a, b, c

a, b, c = f()

In [201]:
print(a, type(a))
print(b, type(b))
print(c, type(c))

5 <class 'int'>
6 <class 'int'>
7 <class 'int'>


In [203]:
return_value = f()
print(return_value, type(return_value))

(5, 6, 7) <class 'tuple'>


In [204]:
def f():
  a = 5
  b = 6
  c = 7
  return {"a": a, "b": b, "c": c}

In [206]:
print(f(), type(f()))

{'a': 5, 'b': 6, 'c': 7} <class 'dict'>


**Functtions Are Objects**

Since Python functions are objects, many constructs can be easily expressed that are difficult to do in other languages. Suppose we were doing some **data cleaning** and neened to apply a bunch of transformations to the following list of strings: 

In [209]:
states = [" Alabama ", "Georgia!", "Georgia", "georgia", "FlOrIda",
          "south carolina##", "West virginia?"]

Anyone who has ever worked with user-submitted survey data has seen messy results like these. Lots of things need to happen to make this list of strings uniform and ready for analysis: stripping whitespace, removing punctuation symbols, and standardizing proper capitalization. One way to do this is to use built-in string methods along with the `re` standard library module for regular expressions:

In [216]:
import re
 
def clean_strings(strings):
  result = []
  for value in strings:
    value = value.strip()                # remove white space
    value = re.sub("[!#?]", "", value)   # replace [!#?] with ""
    value = value.title()                # make title
    result.append(value)                 # append to a list
  return result

In [217]:
clean_strings(states)

['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South Carolina',
 'West Virginia']

In [220]:
def remove_punctuation(value):
  return re.sub("[!#?]", "", value)

clean_ops = [str.strip, remove_punctuation, str.title]

def clean_strings(strings, ops):
  result = []
  for value in strings:
    for func in ops:
      value = func(value)
    result.append(value)
  return result

In [222]:
clean_strings(states, clean_ops)

['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South Carolina',
 'West Virginia']

In [221]:
for x in map(remove_punctuation, states):
  print(x)

 Alabama 
Georgia
Georgia
georgia
FlOrIda
south carolina
West virginia


**Anonymous (Lambda) Functions**

In [223]:
def short_function(x):
  return x * 2

In [224]:
equiv_anon = lambda x: x *2

In [225]:
def apply_to_list(some_list, f):
  return [f(x) for x in some_list]

In [226]:
ints = [4, 0, 1, 5, 6]

apply_to_list(ints, lambda x: x * 2)

[8, 0, 2, 10, 12]

In [227]:
strings = ["foo", "card", "bar", "aaa", "abab"]

strings.sort(key=lambda x: len(set(x)))
strings

['aaa', 'foo', 'abab', 'bar', 'card']

**Generators**

Many objects in Python support iteration, such as over objects in a list or lines in a file. This is accomplished by means of the iterator protocal, a generic way to **make objects iterable**.

In [228]:
some_dict = {"a": 1, "b": 2, "c": 3}

for key in some_dict:
  print(key)

a
b
c


In [229]:
dict_iterator = iter(some_dict)
dict_iterator

<dict_keyiterator at 0x7f82d746ecc0>

In [230]:
list(dict_iterator)

['a', 'b', 'c']

In [231]:
def squares(n=10):
  print(f"Generating squres from 1 to {n ** 2}")
  for i in range(1, n + 1):
    yield i ** 2

In [233]:
gen = squares()
gen

<generator object squares at 0x7f82d7474eb0>

In [234]:
for x in gen:
  print(x, end=" ")

Generating squres from 1 to 100
1 4 9 16 25 36 49 64 81 100 

**Generator expressions**

In [235]:
gen = (x ** 2 for x in range(100))
gen

<generator object <genexpr> at 0x7f82d7491040>

In [236]:
def _make_gen():
  for x in range(100):
    yield x ** 2

gen = _make_gen()

In [237]:
sum(x ** 2 for x in range(100))

328350

In [238]:
dict((i, i ** 2) for i in range(5))

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

**itertools module**

The standard library `itertools` module has a collection of generators for many common data algorithms. For example, `groupby` takes any sequence and a function, grouping consecutive elements in the sequence by return value of the function.

In [239]:
import itertools 

def first_letter(x):
  return x[0]

names = ["Alan", "Adam", "Wes", "Will", "Albert", "Steven"]

for letter, names in itertools.groupby(names, first_letter):
  print(letter, list(names))

A ['Alan', 'Adam']
W ['Wes', 'Will']
A ['Albert']
S ['Steven']


<table>
  <tr>
    <th colspan="2"><h4><b>Some useful itertools functions</b></h4></th>
  </tr>
  <tr>
    <th>Function</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>chain(*iterables) </td>
    <td>Generates a sequence by chaining iterators together. Once elements from<br /> the first iterator are exhausted, elements from the next iterator are returned,<br />and so on.</td>
  </tr>
  <tr>
    <td>combinations(iterable, k)</td>
    <td>Generates a sequence of all possible k-tuples of elements in the iterable,<br /> ignoring order and without replacement (see also thecompanion function<br />combinations_with_replacement).</td>
  </tr>
  <tr>
    <td>permutations(iterable, k)</td>
    <td>Generates a sequence of all possible k-tuples of elements in the iterable,<br />respecting order.</td>
  </tr>
  <tr>
    <td>groupby(iterable[, keyfunc]) </td>
    <td>Generates (key, sub-iterator) for each unique key.</td>
  </tr>
  <tr>
    <td>product(*iterables, repeat=1)</td>
    <td>Generates the Cartesian product of the input iterables as tuples, similar to<br />a nested for loop.</td>
  </tr>
</table>

**Errors and Exception Handling**

Handling Python errors or exceptions gracefully is an important part of building robust programs. In data analysis applications, many functions work only on certain kinds of input

In [240]:
float("1.2345")

1.2345

In [305]:
# float("something")

In [242]:
def attempt_float(x):
  try:
    return float(x)
  except:
    return x

In [243]:
attempt_float("1.2345")

1.2345

In [244]:
attempt_float("something")

'something'

In [306]:
# float((1, 2))

In [245]:
def attempt_float(x):
  try:
    return float(x)
  except ValueError:
    return x

In [248]:
def attempt_float(x):
  try:
    return float(x)
  except (TypeError, ValueError):
    return x

In [249]:
attempt_float((1, 2))

(1, 2)

## **3.3 Files and the Operating System**

In [253]:
url = "https://raw.githubusercontent.com/wesm/pydata-book/3rd-edition/examples/segismundo.txt"

In [254]:
import requests

response = requests.get(url)

if response.status_code == 200:
    # Write the contents of the text file to a local file
    with open("segismundo.txt", "w") as file:
        file.write(response.text)
else:
    print("Error fetching file:", response.status_code)

In [265]:
path = "./segismundo.txt"

In [None]:
f = open(path, encoding="utf-8")

for line in f:
  print(line)

In [269]:
lines = [x.rstrip() for x in open(path, encoding="utf-8")]
lines

['Sueña el rico en su riqueza,',
 'que más cuidados le ofrece;',
 '',
 'sueña el pobre que padece',
 'su miseria y su pobreza;',
 '',
 'sueña el que a medrar empieza,',
 'sueña el que afana y pretende,',
 'sueña el que agravia y ofende,',
 '',
 'y en el mundo, en conclusión,',
 'todos sueñan lo que son,',
 'aunque ninguno lo entiende.',
 '']

In [270]:
f.close()

In [268]:
with open(path, encoding="utf-8") as f:
  lines = [x.rstrip() for x in f]
  
lines

['Sueña el rico en su riqueza,',
 'que más cuidados le ofrece;',
 '',
 'sueña el pobre que padece',
 'su miseria y su pobreza;',
 '',
 'sueña el que a medrar empieza,',
 'sueña el que afana y pretende,',
 'sueña el que agravia y ofende,',
 '',
 'y en el mundo, en conclusión,',
 'todos sueñan lo que son,',
 'aunque ninguno lo entiende.',
 '']

<table>
  <tr>
    <th colspan="2"><h4><b>Python file modes
</b></h4></th>
  </tr>
  <tr>
    <th>Mode</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>r</td>
    <td>Read-only mode</td>
  </tr>
  <tr>
    <td>w</td>
    <td>Write-only mode; creates a new file (erasing the data for any file with the same name)</td>
  </tr>
  <tr>
    <td>x</td>
    <td>Write-only mode; creates a new file but fails if the file path already exists</td>
  </tr>
  <tr>
    <td>a</td>
    <td>Append to existing file (creates the file if it does not already exist)</td>
  </tr>
  <tr>
    <td>r+</td>
    <td>Read and write</td>
  </tr>
  <tr>
    <td>b</td>
    <td>Add to mode for binary files (i.e., "rb" or "wb")</td>
  </tr>
  <tr>
    <td>t</td>
    <td>Text mode for files (automatically decoding bytes to Unicode); this is the default if not specified</td>
  </tr>
</table>

`read` returns a certain number of characters from the file

`tell` gives the current position

`seek` changes the file position to the indicated byte in the file

In [271]:
f1 = open(path)

In [272]:
f1.read(10)

'Sueña el r'

In [273]:
f2 = open(path, mode="rb")
f2.read(10)

b'Sue\xc3\xb1a el '

In [274]:
f1.tell()

11

In [275]:
f2.tell()

10

In [276]:
import sys

sys.getdefaultencoding()

'utf-8'

In [277]:
f1.seek(3)

3

In [278]:
f1.read(1)

'ñ'

In [279]:
f1.tell()

5

In [280]:
f1.close()
f2.close()

In [281]:
path

'./segismundo.txt'

In [282]:
with open("tmp.txt", mode="w") as handle:
  handle.writelines(x for x in open(path) if len(x) > 1)

In [307]:
with open("tmp.txt") as f:
  lines = f.readlines()

lines

['Sueña el rico en su riqueza,\n',
 'que más cuidados le ofrece;\n',
 'sueña el pobre que padece\n',
 'su miseria y su pobreza;\n',
 'sueña el que a medrar empieza,\n',
 'sueña el que afana y pretende,\n',
 'sueña el que agravia y ofende,\n',
 'y en el mundo, en conclusión,\n',
 'todos sueñan lo que son,\n',
 'aunque ninguno lo entiende.\n']

<table>
  <tr>
    <th colspan="2"><h4><b>Important Python file methods or attributes</b></h4></th>
  </tr>
  <tr>
    <th>Method/attribute</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>read([size])</td>
    <td>Return data from file as bytes or string depending on the file mode, with optional size<br />argument indicating the number of bytes or string characters to read</td>
  </tr>
  <tr>
    <td>readable()</td>
    <td>Return True if the file supports read operations</td>
  </tr>
  <tr>
    <td>readlines([size])</td>
    <td>Return list of lines in the file, with optional size argument</td>
  </tr>
  <tr>
    <td>write(string)</td>
    <td>Write passed string to file</td>
  </tr>
  <tr>
    <td>writable()</td>
    <td>Return True if the file supports write operations</td>
  </tr>
  <tr>
    <td>writelines(strings)</td>
    <td>Write passed sequence of strings to the file
</td>
  </tr>
  <tr>
    <td>close()</td>
    <td>Close the file object</td>
  </tr>
  <tr>
    <td>flush()</td>
    <td>Flush the internal I/O buffer to disk</td>
  </tr>
  <tr>
    <td>seek(pos)</td>
    <td>Move to indicated file position (integer)</td>
  </tr>
  <tr>
    <td>seekable()</td>
    <td>Return True if the file object supports seeking and thus random access (some file-like objects do not)</td>
  </tr>
  <tr>
    <td>tell()</td>
    <td>Return current file position as integer</td>
  </tr>
  <tr>
    <td>closed</td>
    <td>True if the file is closed</td>
  </tr>
  <tr>
    <td>encoding</td>
    <td>The encoding used to interpret bytes in the file as Unicode (typically UTF-8)</td>
  </tr>
</table>

**Bytes and Unicode with Files**

In [289]:
with open(path) as f:
  chars = f.read(10)

chars

'Sueña el r'

In [287]:
len(chars)

10

In [290]:
with open(path, mode="rb") as f:
  data = f.read(10)

data

b'Sue\xc3\xb1a el '

In [291]:
data.decode("utf-8")

'Sueña el '

In [304]:
# data[:4].decode("utf-8")

In [296]:
sink_path = 'sink.text'

with open(path) as source:
  with open(sink_path, "x", encoding="iso-8859-1") as sink:
    sink.write(source.read())

In [297]:
with open(sink_path, encoding="iso-8859-1") as f:
  print(f.read(10))

Sueña el r


In [298]:
f = open(path, encoding='utf-8')

In [299]:
f.read(5)

'Sueña'

In [300]:
f.seek(4)

4

In [302]:
# f.read(1)

In [303]:
f.close()