Let's start from what we have seen in the previous notebook...

# Let's read some wind data

In [None]:
# first, the imports
import os
import datetime as dt

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(19760812)
%matplotlib inline

In [None]:
# we read data from file 'mast.txt'
ipath = os.path.join('Datos', 'mast.txt')

# Now, we define a function to parse the dates
def dateparse(date, time):
    YY = 2000 + int(date[:2])
    MM = int(date[2:4])
    DD = int(date[4:])
    hh = int(time[:2])
    mm = int(time[2:])
    
    return dt.datetime(YY, MM, DD, hh, mm, 0)
    

cols = ['Date', 'time', 'wspd', 'wspd_max', 'wdir',
        'x1', 'x2', 'x3', 'x4', 'x5', 
        'wspd_std']
wind = pd.read_csv(ipath, sep = "\s*", names = cols, 
                   parse_dates = [[0, 1]], index_col = 0,
                   date_parser = dateparse)

# Basic information (in this case from a  `DataFrame`)

In [None]:
wind.info()

In [None]:
wind.describe()

Access the índexes, values, columns (`Series` doesn't have this attribute):

In [None]:
wind.index

In [None]:
wind.values

In [None]:
wind.values.shape

In [None]:
wind.columns

# Removing/extracting columns

There are some columns that are not interesting (named `'x1'`, `'x2'`, `'x3'`, `'x4'` and `'x5'`). We will remove these columns from our `DataFrame`. We have several options to do so:

In [None]:
# We remove a column using 'del' keyword
del wind['x1']
wind.head(3)

In [None]:
# We extract a column using 'pop' method
s = wind.pop('x2')
wind.head(3)

In [None]:
del wind['x3']
del wind['x4']
del wind['x5']

In [None]:
wind.info()

One of the columns, that extracted using the `pop` method, is referenced using the  `s` variable, and it is a `Series`:

In [None]:
type(s)

In [None]:
s.head(3)

In [None]:
s.info()

Is it a TimeSeries?, i.e., Are all the indexes dates?

In [None]:
# s.is_time_series deprecated
s.index.is_all_dates

In [None]:
s.describe()

In [None]:
s.dtype

In [None]:
s.values

In [None]:
s.index

In [None]:
s.columns

# Working with the indexes

In [None]:
# We create a DataFrame
df = pd.DataFrame(np.array([['a','b','c','d','e'], [10,20,30,40,50]]).T,
                  columns = ['col1', 'col2'])
df

We can re-write the indexes at any time:

In [None]:
df.index = np.arange(1,6) * 100
df

We can use a column to define our indexes:

In [None]:
df.set_index('col1', inplace = True)
df

We can undo the `set_index` action using:

In [None]:
df.reset_index(inplace = True)
df

As with indexes, we can change the name of the columns:

In [None]:
df.columns = ['column1', 'column2']
df

The indexes 'column' can have a name (already seen before):

In [None]:
df.index.name = 'indices'
df

# `pandas` data structures are numpy arrays on steroids

Don't forget that behind the scenes we have numpy arrays and `pandas` exposes much of the numpy arrays functionality directly from their data structures.

We can see, for instance, what attributes of a numpy array have an equivalent directly in a `Series` (or `DataFrame`):

In [None]:
numpy_attrs = dir(s.values)
series_attrs = dir(s)
for attr in numpy_attrs:
    if attr not in series_attrs:
        print('NOOOOOOOOOOOOOOOOOOOOOO', attr)
    else:
        print(attr)

So, a lot of operations we do with a numpy array can be made directly from a `pandas` data structure:

In [None]:
s.mean()

In [None]:
s.min()

In [None]:
s.max()

In [None]:
s[0:10].tolist()

...

<div class="alert alert-danger">
<p><b>Note:</b></p>
<p>Sometimes could be convenient to use directly the numpy arrays method when performance is an issue.</p>
</div>

In [None]:
%%timeit 
s.mean()
s.min()
s.max()

In [None]:
%%timeit 
s.values.mean()
s.values.min()
s.values.max()

# And where are the steroids?

Be patient!!!!!!

## 'Stuff' that are in a `Series` but not in a numpy array

In [None]:
numpy_attrs = dir(s.values)
series_attrs = dir(s)
for attr in series_attrs:
    if attr not in numpy_attrs:
        print('NOOOOOOOOOOOOOOOOOOOOOO', attr)
    else:
        print(attr)

## 'Stuff' that are in a `DataFrame` but not in a numpy array

In [None]:
numpy_attrs = dir(s.values)
dataframe_attrs = dir(wind)
for attr in dataframe_attrs:
    if attr not in numpy_attrs:
        print('NOOOOOOOOOOOOOOOOOOOOOO', attr)
    else:
        print(attr)

## Examples of some useful operations.

We will see some of this in a more detailed manner and with examples in the next notebooks.

In [None]:
wind['wspd'].apply(lambda x: str(x) + ' m/s')

In [None]:
wind.corr()

In [None]:
wind.cumsum()

In [None]:
wind.diff()

Now, we are skimming this. Later we will see it in a more detailed way:

Let's do some simple examples:

In [None]:
# Calculate the mean wind speed (column 'wspd'):


In [None]:
# Calculate the median of the wind direction (column 'wdir'):


In [None]:
# Obtain the maximum difference between two time steps
# (column 'wspd_std')


Other interesting methods are the `pd.rolling_*`:

In [None]:
pd.rolling_mean(wind, 5, center = True).head(10)

As you can read in the previous warning message the `rolling_*` functions are deprecated and will not be available in the near future. In the previous text cell I wrote explicitly 'methods' because all the `rolling_*` functions now are grouped in the `rolling` method. How we can do it with the `rolling` method:

In [None]:
wind.rolling(5, center = True).mean().head(10)

Other interesting 'stuff' in a `DataFrame` (change `DataFrame` with `Series` or other data structures):

In [None]:
import inspect
info = inspect.getmembers(wind, predicate=inspect.ismethod)

for stuff in info:
    print(stuff[0])

# Working with missing data

It is quite usual that our datasets have missing data.

In [None]:
index = pd.date_range('2000/01/01', freq = '12H', periods = 10)
index = index.append(pd.date_range('2000/01/10', freq = '1D', periods = 3))
df = pd.DataFrame(np.random.randint(1, 100, size = (13, 3)), 
                  index = index, columns = ['col1', 'col2', 'col3'])
df

In [None]:
# Let's fill some values with NaN
df[df > 70] = np.nan
df

As opposed to what happens with a numpy array, in `pandas`, operations ignore `NaN` values unless we explicitly state the opposite. Let's see this in action:

In [None]:
df['col1'].sum()

In [None]:
df['col1'].values.sum()

In [None]:
df['col1'].sum(skipna = False)

We can detect 'null' values (`NaN`) using `isnull`:

In [None]:
df.isnull()

Or not null using `notnull`:

In [None]:
df.notnull()

We can see that we have `NaN` values. We can fill them using `ffill` or `bfill` (similar to `fillna(method = 'ffill')` and to `fillna(method = 'bfill')`, respectively):

In [None]:
# Let's remember how is our DataFrame
df

In [None]:
df.ffill()

In [None]:
df.bfill()

In [None]:
df.fillna(value = 'Kiko')

Let's create a new `DataFrame` with indexes with 12H frequency.

In [None]:
df = pd.DataFrame(np.random.randint(1, 100, size = (15, 3)), 
                  index = pd.date_range('2015/01/01', freq = '12H', periods = 15))
df

In [None]:
df[df > 70] = 'Kiko'
df

We can remove rows or columns that have a `NaN` value, all `NaN` values,...

In [None]:
df[df == 'Kiko'] = np.nan
df

In [None]:
# We remove the rows where any value of the row is NaN
# axis = 0 would be equivalent to axis = 'rows' or axis = 'index'
# Later we will see more about the axis keyword...
df.dropna(axis = 'rows') 

In [None]:
# Let's remove the rows where all the values in the row are NaN
df.iloc[2, :] = np.nan
df.dropna(axis = 'rows', how = 'all')

In [None]:
# We can remove columns where any valu in the column is a NaN
df.dropna(axis = 'columns', how = 'any') # axis = 1 is equivalent to axis = 'columns'. More on this later.
                                         # how = 'any' is he default value so we don't need to add it.

In [None]:
# Let's add a column only with not null values and let's repeat the operation
df['col4'] = 9999
df.dropna(axis = 'columns', how = 'any')

In [None]:
# Now let's add a column where all the values are NaN
df['col5'] = np.nan
df.dropna(axis = 'columns', how = 'all')

We can also fill `NaN` values using `interpolate`:

In [None]:
df.interpolate()

But, what is happening here!!! Why null values are not being interpolated?

Let's see how are the columns.

In [None]:
df.info()

We can see columns `0`, `1` y `2` are of type `object` and this type is not a number. On the other hand, in the column `col4` there isn't any value to interpolate. Last, in column `col5` all the values are `NaN`. Let's convert the first three columns to interpolate:

In [None]:
df[[0, 1, 2]] = df[[0, 1, 2]].astype(np.float)

In [None]:
df.interpolate()

In [None]:
# Have a look to the docs of the 'interpolate' method to know how to use it
