# Reading Tabular Data into DataFrames

Pandas is a widely-used Python library for statistics, particularly on tabular data.

First time use (Installing via cmd):

`pip install pandas`

Loading:

`import pandas as pd`

In [None]:
import pandas as pd

data = pd.read_csv('data/gapminder_gdp_oceania.csv')
print(data)

+ The columns in a dataframe are the observed variables, and the rows are the observations.
+ Pandas uses backslash `\` to show wrapped lines when output is too wide to fit the screen.

### Using `index_col` to specify a column's name as row headings.

In [None]:
data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
print(data)

### `DataFrame.info` - Getting more information about the data

In [None]:
data.info()

+ This is a DataFrame
+ Two rows named 'Australia' and 'New Zealand'
+ Twelve columns, each of which has two actual 64-bit floating point values.
+ Uses 208 bytes of memory.

### `DataFrame.columns` - variable that stores information about the dataframe’s columns.

In [None]:
print(data.columns)

### `DataFrame.T` - Is used to Transpose  a dataframe

+ Sometimes want to treat columns as rows and vice versa.
+ Transpose (written `.T`) doesn’t copy the data, just changes the program’s view of it.
+ Like `columns`, it is a member variable.

In [None]:
print(data.T)

### `DataFrame.describe` - Is used to get the summary of the data.

In [None]:
print(data.describe())

# Pandas DataFrames/Series

A DataFrame is a collection of Series; The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.

Pandas is built on top of the Numpy library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.

### Selecting values

To access a value at the position `[i,j]` of a DataFrame, we have two options, depending on what is the meaning of `i` in use. Remember that a DataFrame provides an index as a way to identify the rows of the table; a row, then, has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.

### `DataFrame.iloc[..., ...]` - Is used to select the values  by their (entry) position.

In [None]:
import pandas as pd
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(data.iloc[0, 0])

### `DataFrame.loc[..., ...]`  - Is used to select the values by their (entry) label.

In [None]:
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(data.loc["Albania", "gdpPercap_1952"])

### `:` - Use this on its own to mean all columns or all rows.

In [None]:
print(data.loc["Albania", :])

In [None]:
print(data.loc[:, "gdpPercap_1952"])

### Select multiple columns or rows using `DataFrame.loc` and a named slice.

In [None]:
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'])

### Result of slicing can be used in further operations.

In [None]:
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max())


In [None]:
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].min())

### Use comparisons to select data based on value.

In [None]:
# Use a subset of data to keep output readable.
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:\n', subset)

# Which values were greater than 10000 ?
print('\nWhere are values large?\n', subset > 10000)

### Select values or NaN using a Boolean mask.

A frame full of Booleans is sometimes called a mask because of how it can be used.

In [None]:
mask = subset > 10000
print(subset[mask])

In [None]:
print(subset[subset > 10000].describe())

### Group By: split-apply-combine

Pandas vectorizing methods and grouping operations are features that provide users much flexibility to analyse their data.

In [None]:
mask_higher = data > data.mean()
wealth_score = mask_higher.aggregate('sum', axis=1) / len(data.columns)
wealth_score

In the above code snippet
We wanted to have clearer view of  on how the European countries split themselves according to their GDP.

1. We may have a glance by splitting the countries in two groups during the years surveyed, those who presented a GDP higher than the European average and those with a lower GDP.
2. We then estimate a wealthy score based on the historical (from 1962 to 2007) values, where we account how many times a country has participated in the groups of lower or higher GDP

In [None]:
data.groupby(wealth_score).sum()

In the above code snippet for each group in the wealth_score table, we sum their (financial) contribution across the years surveyed.

# Plotting
### `matplotlib` is the most widely used scientific plotting library in Python.

+ Commonly use a sub-library called matplotlib.pyplot.
+ The Jupyter Notebook will render plots inline if we ask it to using a “magic” command.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
time = [0, 1, 2, 3]
position = [0, 100, 200, 300]

plt.plot(time, position)
plt.xlabel('Time (hr)')
plt.ylabel('Position (km)')

### Plot data directly from a `Pandas dataframe`.

In [None]:
import pandas as pd

data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')

# Extract year from last 4 characters of each column name
# The current column names are structured as 'gdpPercap_(year)', 
# so we want to keep the (year) part only for clarity when plotting GDP vs. years
# To do this we use strip(), which removes from the string the characters stated in the argument
# This method works on strings, so we call str before strip()

years = data.columns.str.strip('gdpPercap_')

# Convert year values to integers, saving results back to dataframe

data.columns = years.astype(int)

data.loc['Australia'].plot()

## Plot data directly from a `Pandas dataframe`.

+ We can also plot Pandas dataframes.
+ This implicitly uses `matplotlib.pyplot`.
+ Before plotting, we convert the column headings from a string to integer data type, since they represent numerical values

In [None]:
import pandas as pd

data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')

# Extract year from last 4 characters of each column name
# The current column names are structured as 'gdpPercap_(year)', 
# so we want to keep the (year) part only for clarity when plotting GDP vs. years
# To do this we use strip(), which removes from the string the characters stated in the argument
# This method works on strings, so we call str before strip()

years = data.columns.str.strip('gdpPercap_')

# Convert year values to integers, saving results back to dataframe

data.columns = years.astype(int)

data.loc['Australia'].plot()

### Select and transform data, then plot it.
+ By default, DataFrame.plot plots with the rows as the X axis.
+ We can transpose the data in order to plot multiple series.

In [None]:
data.T.plot()
plt.ylabel('GDP per capita')

In [None]:
# Different styles of plots (Bar Graph)
plt.style.use('ggplot')
data.T.plot(kind='bar')
plt.ylabel('GDP per capita')

In [None]:
# We can plot multiple datasets together

# Select two countries' worth of data.
gdp_australia = data.loc['Australia']
gdp_nz = data.loc['New Zealand']

# Plot with differently-colored markers.
plt.plot(years, gdp_australia, 'b-', label='Australia') # This is making labels
plt.plot(years, gdp_nz, 'g-', label='New Zealand')

# Create legend.
plt.legend(loc='upper left')
plt.xlabel('Year')
plt.ylabel('GDP per capita ($)')

In [None]:
plt.scatter(gdp_australia, gdp_nz) # This gives the correlation between the GDP's of the two countries Australia and New Zealand

### Saving your plot to a file

If you are satisfied with the plot you see you may want to save it to a file, perhaps to include it in a publication. There is a function in the matplotlib.pyplot module that accomplishes this: savefig. Calling this function, e.g. with

```python
plt.savefig('my_figure.png')
```

will save the current figure to the file my_figure.png. The file format will automatically be deduced from the file name extension (other formats are pdf, ps, eps and svg).

# Analyzing Patient Data

### Loading data into Python

<u> Using Inflamation data for this </u>
    
+ We can do that using a library called NumPy, which stands for Numerical Python.
+ To tell Python that we’d like to start using NumPy, we need to import it:

In [None]:
import numpy

numpy.loadtxt(fname='data\inflammation-01.csv', delimiter=',') # `loadtxt` 

In [None]:
# Our call to numpy.loadtxt read our file but didn’t save the data in memory.
#To do that, we need to assign the array to a variable. 
#In a similar manner to how we assign a single value to a variable,
## we can also assign an array of values to a variable using the same syntax. 
# Let’s re-run numpy.loadtxt and save the returned data:

data = numpy.loadtxt(fname='data\inflammation-01.csv', delimiter=',')

print(data)

In [None]:
print(type(data))

In [None]:
print(data.dtype)

In [None]:
print(data.shape)

So the data has 60 rows and 40 columns

In [None]:
print('first value in data:', data[0, 0])

In [None]:
print('first value in data:', data[0, 0])

In [None]:
print('middle value in data:', data[30, 20])

![image1](data\image1.png "Data")

We can visualize the data like this

### Slicing data

An index like `[30, 20]` selects a single element of an array, but we can select whole sections as well. For example, we can select the first ten days (columns) of values for the first four patients (rows) like this:

In [None]:
print(data[0:4, 0:10])

The slice `0:4` means, “Start at index 0 and go up to, but not including, index 4”. Again, the up-to-but-not-including takes a bit of getting used to, but the rule is that the difference between the upper and lower bounds is the number of values in the slice.

In [None]:
print(data[5:10, 0:10]) # We dont need to start at '0'

In [None]:
# We can ignore the upper/lower bounds python will take care of it 
small = data[:3, 36:]
print('small is:')
print(small)

### Analyzing data
we can ask NumPy to compute data’s mean value:

In [None]:
print(numpy.mean(data))

In [None]:
# Multiple value assignment
maxval, minval, stdval = numpy.max(data), numpy.min(data), numpy.std(data)

print('maximum inflammation:', maxval)
print('minimum inflammation:', minval)
print('standard deviation:', stdval)

When analyzing data, though, we often want to look at variations in statistical values, such as the maximum inflammation per patient or the average inflammation per day. One way to do this is to create a new temporary array of the data we want, then ask it to do the calculation:

In [None]:
patient_0 = data[0, :] # 0 on the first axis (rows), everything on the second (columns)
print('maximum inflammation for patient 0:', numpy.max(patient_0))

In [None]:
# We don’t actually need to store the row in a variable of its own. Instead, we can combine the selection and the function call:
print('maximum inflammation for patient 2:', numpy.max(data[2, :]))

##### Visualize like this
What if we need the maximum inflammation for each patient over all days (as in the next diagram on the left) or the average for each day (as in the diagram on the right)? As the diagram below shows, we want to perform the operation across an axis:

![image2](data\image2.png "Data")

To support this functionality, most array functions allow us to specify the axis we want to work on. If we ask for the average across axis 0 (rows in our 2D example), we get:

In [None]:
print(numpy.mean(data, axis=0))

In [None]:
# As a quick check, we can ask this array what its shape is:
print(numpy.mean(data, axis=0).shape)

In [None]:
print(numpy.mean(data, axis=1))

# Visualizing Tabular Data

In [None]:
# we will import the pyplot module from matplotlib and use two of its functions to create and display a heat map of our data:
import matplotlib.pyplot
image = matplotlib.pyplot.imshow(data)
matplotlib.pyplot.show()

Blue pixels in this heat map represent low values, while yellow pixels represent high values. As we can see, inflammation rises and falls over a 40-day period. Let’s take a look at the average inflammation over time:

```python
ave_inflammation = numpy.mean(data, axis=0)
ave_plot = matplotlib.pyplot.plot(ave_inflammation)
matplotlib.pyplot.show()
```
<mark> Run the code snippets to see the result</mark> 

Here, we have put the average inflammation per day across all patients in the variable ave_inflammation, then asked matplotlib.pyplot to create and display a line graph of those values. The result is a roughly linear rise and fall, which is suspicious: we might instead expect a sharper rise and slower fall. Let’s have a look at two other statistics:

<b> These are used to get max and min plots </b>
```python
max_plot = matplotlib.pyplot.plot(numpy.max(data, axis=0))
matplotlib.pyplot.show()
```
```python
min_plot = matplotlib.pyplot.plot(numpy.min(data, axis=0))
matplotlib.pyplot.show()
```

#### Grouping plots

The function `matplotlib.pyplot.figure()` creates a space into which we will place all of our plots. The parameter figsize tells Python how big to make this space. Each subplot is placed into the figure using its `add_subplot` method. The `add_subplot` method takes 3 parameters. The first denotes how many total rows of subplots there are, the second parameter refers to the total number of subplot columns, and the final parameter denotes which subplot your variable is referencing (left-to-right, top-to-bottom). Each subplot is stored in a different variable (`axes1`, `axes2`, `axes3`). Once a subplot is created, the axes can be titled using the `set_xlabel()` command (or `set_ylabel()`). Here are our three plots side by side:



In [None]:
import numpy
import matplotlib.pyplot

data = numpy.loadtxt(fname='data\inflammation-01.csv', delimiter=',')

fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))

axes1 = fig.add_subplot(1, 3, 1)
axes2 = fig.add_subplot(1, 3, 2)
axes3 = fig.add_subplot(1, 3, 3)

axes1.set_ylabel('average')
axes1.plot(numpy.mean(data, axis=0))

axes2.set_ylabel('max')
axes2.plot(numpy.max(data, axis=0))

axes3.set_ylabel('min')
axes3.plot(numpy.min(data, axis=0))

fig.tight_layout()

matplotlib.pyplot.savefig('inflammation.png')
matplotlib.pyplot.show()


# Analyzing Data from Multiple Files

In [None]:
import glob

The `glob` library contains a function, also called `glob`, that finds files and directories whose names match a pattern. We provide those patterns as strings: the character `*` matches zero or more characters, while `?` matches any one character. We can use this to get the names of all the CSV files in the current directory:

In [None]:
print(glob.glob('data\inflammation*.csv'))

+ As these examples show, glob.glob’s result is a list of file and directory paths in arbitrary order.
+ This means we can loop over it to do something with each filename in turn.
+ In our case, the “something” we want to do is generate a set of plots for each file in our inflammation dataset.
+ If we want to start by analyzing just the first three files in alphabetical order, we can use the sorted built-in function to generate a new sorted list from the glob.glob output:

In [None]:
import glob
import numpy
import matplotlib.pyplot

filenames = sorted(glob.glob('data\inflammation*.csv'))
filenames = filenames[0:3]
for filename in filenames:
    print(filename)

    data = numpy.loadtxt(fname=filename, delimiter=',')

    fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))

    axes1 = fig.add_subplot(1, 3, 1)
    axes2 = fig.add_subplot(1, 3, 2)
    axes3 = fig.add_subplot(1, 3, 3)

    axes1.set_ylabel('average')
    axes1.plot(numpy.mean(data, axis=0))

    axes2.set_ylabel('max')
    axes2.plot(numpy.max(data, axis=0))

    axes3.set_ylabel('min')
    axes3.plot(numpy.min(data, axis=0))

    fig.tight_layout()
    matplotlib.pyplot.show()


# Looping Over Data Sets

#### Use a `for` loop to process files given a list of their names.

In [None]:
import pandas as pd
for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']:
    data = pd.read_csv(filename, index_col='country')
    print(filename, data.min())

#### Use `glob.glob` to find sets of files whose names match a pattern.

+ In Unix, the term “globbing” means “matching a set of files with a pattern”.
+ The most common patterns are:
  + `*` meaning “match zero or more characters”
  + `?` meaning “match exactly one character”
+ Python’s standard library contains the `glob` module to provide pattern matching functionality
+ The `glob` module contains a function also called glob to match file patterns
+ E.g., `glob.glob('*.txt')` matches all files in the current directory whose names end with .txt.
+ Result is a (possibly empty) list of character strings.

In [None]:
import glob
print('all csv files in data directory:', glob.glob('data/*.csv'))

In [None]:
print('all PDB files:', glob.glob('*.pdb'))

#### Use `glob` and `for` to process batches of files.

In [None]:
# Helps a lot if the files are named and stored systematically and consistently so that simple patterns will find the right data.
for filename in glob.glob('data/gapminder_*.csv'):
    data = pd.read_csv(filename)
    print(filename, data['gdpPercap_1952'].min())

This includes all data, as well as per-region data.
Use a more specific pattern in the exercises to exclude the whole data set.
But note that the minimum of the entire data set is also the minimum of one of the data sets, which is a nice check on correctness.

# Creating Functions

Let’s start by defining a function `fahr_to_celsius` that converts temperatures from Fahrenheit to Celsius:

In [None]:
def fahr_to_celsius(temp):
    return ((temp - 32) * (5/9))

![image3](data\image3.png "Data")

```python
fahr_to_celsius(32)
```
This command should call our function, using “32” as the input and return the function value.

In fact, calling our own function is no different from calling any other function:

In [None]:
print('freezing point of water:', fahr_to_celsius(32), 'C')
print('boiling point of water:', fahr_to_celsius(212), 'C')

#### Composing Functions

Now that we’ve seen how to turn Fahrenheit into Celsius, we can also write the function to turn Celsius into Kelvin:

In [None]:
def celsius_to_kelvin(temp_c):
    return temp_c + 273.15

print('freezing point of water in Kelvin:', celsius_to_kelvin(0.))

#### Tidying up

Now that we know how to wrap bits of code up in functions, we can make our inflammation analysis easier to read and easier to reuse. First, let’s make a visualize function that generates our plots:

```python
def visualize(filename):

    data = numpy.loadtxt(fname=filename, delimiter=',')

    fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))

    axes1 = fig.add_subplot(1, 3, 1)
    axes2 = fig.add_subplot(1, 3, 2)
    axes3 = fig.add_subplot(1, 3, 3)

    axes1.set_ylabel('average')
    axes1.plot(numpy.mean(data, axis=0))

    axes2.set_ylabel('max')
    axes2.plot(numpy.max(data, axis=0))

    axes3.set_ylabel('min')
    axes3.plot(numpy.min(data, axis=0))

    fig.tight_layout()
    matplotlib.pyplot.show()
```

and another function called `detect_problems` that checks for those systematics we noticed:

```python
def detect_problems(filename):

    data = numpy.loadtxt(fname=filename, delimiter=',')

    if numpy.max(data, axis=0)[0] == 0 and numpy.max(data, axis=0)[20] == 20:
        print('Suspicious looking maxima!')
    elif numpy.sum(numpy.min(data, axis=0)) == 0:
        print('Minima add up to zero!')
    else:
        print('Seems OK!')
```

We can reproduce the previous analysis with a much simpler for loop:

```python
filenames = sorted(glob.glob('data/inflammation*.csv'))

for filename in filenames[:3]:
    print(filename)
    visualize(filename)
    detect_problems(filename)
```

Here is the implemeantation.....

In [None]:
def visualize(filename):

    data = numpy.loadtxt(fname=filename, delimiter=',')

    fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))

    axes1 = fig.add_subplot(1, 3, 1)
    axes2 = fig.add_subplot(1, 3, 2)
    axes3 = fig.add_subplot(1, 3, 3)

    axes1.set_ylabel('average')
    axes1.plot(numpy.mean(data, axis=0))

    axes2.set_ylabel('max')
    axes2.plot(numpy.max(data, axis=0))

    axes3.set_ylabel('min')
    axes3.plot(numpy.min(data, axis=0))

    fig.tight_layout()
    matplotlib.pyplot.show()

In [None]:
def detect_problems(filename):

    data = numpy.loadtxt(fname=filename, delimiter=',')

    if numpy.max(data, axis=0)[0] == 0 and numpy.max(data, axis=0)[20] == 20:
        print('Suspicious looking maxima!')
    elif numpy.sum(numpy.min(data, axis=0)) == 0:
        print('Minima add up to zero!')
    else:
        print('Seems OK!')

In [None]:
filenames = sorted(glob.glob('data\inflammation*.csv'))

for filename in filenames[:3]:
    print(filename)
    visualize(filename)
    detect_problems(filename)

### Testing and Documenting

Once we start putting things in functions so that we can re-use them, we need to start testing that those functions are working correctly. To see how to do this, let’s write a function to offset a dataset so that it’s mean value shifts to a user-defined value:

In [None]:
def offset_mean(data, target_mean_value):
    return (data - numpy.mean(data)) + target_mean_value

In [None]:
z = numpy.zeros((2,2))
print(offset_mean(z, 3))

In [None]:
# That looks right, so let’s try offset_mean on our real data:
data = numpy.loadtxt(fname='data\inflammation-01.csv', delimiter=',')
print(offset_mean(data, 0))

It’s hard to tell from the default output whether the result is correct, but there are a few tests that we can run to reassure us:

In [None]:
print('original min, mean, and max are:', numpy.min(data), numpy.mean(data), numpy.max(data))
offset_data = offset_mean(data, 0)
print('min, mean, and max of offset data are:',
      numpy.min(offset_data),
      numpy.mean(offset_data),
      numpy.max(offset_data))

That seems almost right: the original mean was about 6.1, so the lower bound from zero is now about -6.1. The mean of the offset data isn’t quite zero — we’ll explore why not in the challenges — but it’s pretty close. We can even go further and check that the standard deviation hasn’t changed:

In [None]:
print('std dev before and after:', numpy.std(data), numpy.std(offset_data))


Those values look the same, but we probably wouldn’t notice if they were different in the sixth decimal place. Let’s do this instead:

In [None]:
print('difference in standard deviations before and after:',
      numpy.std(data) - numpy.std(offset_data))

Again, the difference is very small. It’s still possible that our function is wrong, but it seems unlikely enough that we should probably get back to doing our analysis. We have one more task first, though: we should write some documentation for our function to remind ourselves later what it’s for and how to use it.

The usual way to put documentation in software is to add comments like this:

In [None]:
# offset_mean(data, target_mean_value):
# return a new array containing the original data with its mean offset to match the desired value.
def offset_mean(data, target_mean_value):
    return (data - numpy.mean(data)) + target_mean_value

There’s a better way, though. If the first thing in a function is a string that isn’t assigned to a variable, that string is attached to the function as its documentation:



In [None]:
def offset_mean(data, target_mean_value):
    """Return a new array containing the original data
       with its mean offset to match the desired value."""
    return (data - numpy.mean(data)) + target_mean_value

This is better because we can now ask Python’s built-in help system to show us the documentation for the function:

In [None]:
help(offset_mean)

A string like this is called a docstring. We don’t need to use triple quotes when we write one, but if we do, we can break the string across multiple lines:

In [None]:
def offset_mean(data, target_mean_value):
    """Return a new array containing the original data
       with its mean offset to match the desired value.

    Examples
    --------
    >>> offset_mean([1, 2, 3], 0)
    array([-1.,  0.,  1.])
    """
    return (data - numpy.mean(data)) + target_mean_value

help(offset_mean)

# Errors and Exceptions

In [None]:
# This code has an intentional error. You can type it directly or
# use it for reference to understand the error message below.
def favorite_ice_cream():
    ice_creams = [
        'chocolate',
        'vanilla',
        'strawberry'
    ]
    print(ice_creams[3])

favorite_ice_cream()

This particular traceback has two levels. You can determine the number of levels by looking for the number of arrows on the left hand side. In this case:

1. The first shows code from the cell above, with an arrow pointing to Line 11 (which is `favorite_ice_cream()`).

2. The second shows some code in the function `favorite_ice_cream`, with an arrow pointing to Line 9 (which is `print(ice_creams[3])`).

The last level is the actual place where the error occurred. The other level(s) show what function the program executed to get to the next level down. So, in this case, the program first performed a function call to the function `favorite_ice_cream`. Inside this function, the program encountered an error on Line 6, when it tried to run the code `print(ice_creams[3]`).

So what error did the program actually encounter? In the last line of the traceback, Python helpfully tells us the category or type of error (in this case, it is an IndexError) and a more detailed error message (in this case, it says “list index out of range”).

If you encounter an error and don’t know what it means, it is still important to read the traceback closely. That way, if you fix the error, but encounter a new one, you can tell that the error changed. Additionally, sometimes knowing where the error occurred is enough to fix it, even if you don’t entirely understand the message.

If you do encounter an error you don’t recognize, try looking at the official documentation on errors. However, note that you may not always be able to find the error there, as it is possible to create custom errors. In that case, hopefully the custom error message is informative enough to help you figure out what went wrong.

### Syntax Errors
When you forget a colon at the end of a line, accidentally add one space too many when indenting under an `if` statement, or forget a parenthesis, you will encounter a syntax error. This means that Python couldn’t figure out how to read your program. This is similar to forgetting punctuation in English: for example, this text is difficult to read there is no punctuation there is also no capitalization why is this hard because you have to figure out where each sentence ends you also have to figure out where each sentence begins to some extent it might be ambiguous if there should be a sentence break or not

People can typically figure out what is meant by text with no punctuation, but people are much smarter than computers. If Python doesn’t know how to read the program, it will give up and inform you with an error. For example:

In [None]:
def some_function()
    msg = 'hello, world!'
    print(msg)
     return msg

Here, Python tells us that there is a `SyntaxError` on line 1, and even puts a little arrow in the place where there is an issue. In this case the problem is that the function definition is missing a colon at the end.

Actually, the function above has two issues with syntax. If we fix the problem with the colon, we see that there is also an `IndentationError`, which means that the lines in the function definition do not all have the same indentation:

In [None]:
def some_function():
    msg = 'hello, world!'
    print(msg)
     return msg

Both `SyntaxError` and `IndentationError` indicate a problem with the syntax of your program, but an `IndentationError` is more specific: it always means that there is a problem with how your code is indented.

Some indentation errors are harder to spot than others. In particular, mixing spaces and tabs can be difficult to spot because they are both whitespace. In the example below, the first two lines in the body of the function `some_function` are indented with tabs, while the third line — with spaces. If you’re working in a Jupyter notebook, be sure to copy and paste this example rather than trying to type it in manually because Jupyter automatically replaces tabs with spaces.

In [None]:
def some_function():
	msg = 'hello, world!'
	print(msg)
        return msg

### Variable Name Errors

Another very common type of error is called a NameError, and occurs when you try to use a variable that does not exist. For example:

In [None]:
print(a)

### Index Errors
Next up are errors having to do with containers (like lists and strings) and the items within them. If you try to access an item in a list or a string that does not exist, then you will get an error. This makes sense: if you asked someone what day they would like to get coffee, and they answered “caturday”, you might be a bit annoyed. Python gets similarly annoyed if you try to ask it for an item that doesn’t exist:

In [None]:
letters = ['a', 'b', 'c']
print('Letter #1 is', letters[0])
print('Letter #2 is', letters[1])
print('Letter #3 is', letters[2])
print('Letter #4 is', letters[3])

Here, Python is telling us that there is an IndexError in our code, meaning we tried to access a list index that did not exist.

### File Errors

The last type of error we’ll cover today are those associated with reading and writing files: `FileNotFoundError`. If you try to read a file that does not exist, you will receive a `FileNotFoundError` telling you so. If you attempt to write to a file that was opened read-only, Python 3 returns an `UnsupportedOperationError`. More generally, problems with input and output manifest as `IOErrors` or `OSErrors`, depending on the version of Python you use.

In [None]:
file_handle = open('myfile.txt', 'r')

One reason for receiving this error is that you specified an incorrect path to the file. For example, if I am currently in a folder called `myproject`, and I have a file in `myproject/writing/myfile.txt`, but I try to open `myfile.txt`, this will fail. The correct path would be `writing/myfile.txt`. It is also possible that the file name or its path contains a typo.

A related issue can occur if you use the “read” flag instead of the “write” flag. Python will not give you an error if you try to open a file for writing when the file does not exist. However, if you meant to open a file for reading, but accidentally opened it for writing, and then try to read from it, you will get an `UnsupportedOperation` error telling you that the file was not opened for reading:

In [None]:
file_handle = open('myfile.txt', 'w')
file_handle.read()

These are the most common errors with files, though many others exist. If you get an error that you’ve never seen before, searching the Internet for that error type often reveals common reasons why you might get that error.