# Python for Data Science Teaching Session 1: Data Manipulation

## Introduction

### Course Prerequisites

It is advised that course participants have completed WDSS's [Beginner's Python](http://education.wdss.io/beginner-python) course or equivalent including going through most of the additional notes on Pythonic programming. You should be able to get by if this is not the case, but you may want to brush up on the following notes:

- [Lists](https://education.wdss.io/beginners-python/session-four/) and [dictionaries](https://education.wdss.io/beginners-python/session-six/)
- [List comprehension](https://colab.research.google.com/github/warwickdatasciencesociety/beginners-python/blob/master/session-four/session_four_additional_content.ipynb) and [dictionary comprehension](https://colab.research.google.com/github/warwickdatasciencesociety/beginners-python/blob/master/session-four/session_six_additional_content.ipynb)
- [Truthiness and if-expressions](https://colab.research.google.com/github/warwickdatasciencesociety/beginners-python/blob/master/session-three/session_three_additional_content.ipynb)
- [String methods](https://colab.research.google.com/github/warwickdatasciencesociety/beginners-python/blob/master/session-three/session_two_additional_content.ipynb)
- [Importing modules and packages](https://education.wdss.io/beginners-python/session-eight/)

### Session Objectives

- Reading/writing data from/to files
- Exploring the structure and contents of a dataset
- Subsetting and filtering
- Mutating and summarising datasets

### Recommendations and Advice

Checkout [PEP8](https://www.python.org/dev/peps/pep-0008/) and use a [linter](https://jupyterlab-code-formatter.readthedocs.io/en/latest/index.html) if needed.

Google, Google, Google. Use [Stack Overflow](https://stackoverflow.com/) and the [pandas reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) to find the answer you're after.

A warning: data-scientific Python is the wild-west. There are often many ways to achieve the same thing. Although this provides flexibility, it can cause confusion when learning. Don't be put off if it's not clear when and why to use a certain method over another. There might not even be a reason at all more than personal preference.

## Getting Started with pandas

### What is pandas?

Let's ask the team:

> pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

In short, pandas allows you to:
- Read/write data using a wide variety of formats
- Manipulate and transform data
- Combine data sources together (session 4)
- Perform basic analysis and visualisation

It is part of the [SciPy stack](https://www.scipy.org/stackspec.html), a collection of Python packages for scientific programming (closely related to data science).

Once installed (see [bonus session one](https://education.wdss.io/python-for-data-science/bonus-one)), you can import (using its usual alias of `pd`).

In [4]:
# Import pandas
import pandas as pd

### Importing CSVs

In this session, we'll be looking at the [wine quality](https://archive.ics.uci.edu/ml/datasets/wine+quality) dataset from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php). You can download this directly from the site (we'll only be using the data for red wine), or find it on this session's [materials](https://education.wdss.io/python-for-data-science/session-one) on the course website.

CSV stands for comma-separated value, and are plain text files used to store tabular data, one observation per line, and with values separated by commas. E.g.

```
"Numeric Column", "Boolean Column", "Text Column"
4, True, "Cat"
7, False, "Dog"
6, True, "Elephant"
```

CSV files can also be separated by semi-colons. This is common in Europe where are comma is used instead of a decimal separator.

We read CSV files using the `read_csv` function from pandas. When the separating character is not a comma, we have to specify it using the `sep` parameter.

In [6]:
wine = pd.read_csv('winequality-red.csv', sep=';')

The `read_csv` function has a ridiculous number of possible parameters. Read the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) to learn more.

### Viewing a Dataframe's Structure

We can view the whole dataframe by typing it in a code cell.

In [None]:
# Print entire dataframe
wine

There are various attributes of a pandas dataframe.

In [None]:
# Dimensions


In [None]:
# Number of columns


In [None]:
# Column names


We're not going to worry about what an `Index` is in this course. It often works the same as a list but can be converted if needed.

In [None]:
# Column names as list
wine.columns.values.tolist()

In fact, there are many ways to do this (search [Stack Overflow](https://stackoverflow.com/questions/19482970/get-list-from-pandas-dataframe-column-headers) to find out) as there is with many tasks involving the SciPy stack, but this is the most performant.

In [None]:
# Row names (indexes)


In [None]:
# Column types


### Exploring Dataframe Contents

We can obtain simple or more substantial summaries of the dataframe using a variety of methods.

In [10]:
# Top 5 rows
wine.iloc[0:5,:]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [12]:
# Bottom 3 rows


fixed acidity           11.200
volatile acidity         0.280
citric acid              0.560
residual sugar           1.900
chlorides                0.075
free sulfur dioxide     17.000
total sulfur dioxide    60.000
density                  0.998
pH                       3.160
sulphates                0.580
alcohol                  9.800
quality                  6.000
Name: 3, dtype: float64

In [None]:
# Random sample of 4 rows


As with `read_csv`, the `sample` method has many optional arguments for replacement, weights, and random state. We will only ever go through the most critical parameters in this course, so it is your job to read the documentation when you want to go further.

In [14]:
# First 2 rows of random sample of 3 columns
wine.sample(3,axis=1)

Unnamed: 0,fixed acidity,pH,residual sugar
0,7.4,3.51,1.9
1,7.8,3.20,2.6
2,7.8,3.26,2.3
3,11.2,3.16,1.9
4,7.4,3.51,1.9
...,...,...,...
1594,6.2,3.45,2.0
1595,5.9,3.52,2.2
1596,6.3,3.42,2.3
1597,5.9,3.57,2.0


In [None]:
# Numerical summaries of columns


## Subsetting and Filtering

Also see `.info()` and `.count()` for similar functionality. 

### Subsetting Rows and Columns

Pandas has two main methods of subsetting a dataframe:

- `.loc`: label-based
- `.iloc`: integer-based (using zero-based indexing)

These both accept single values, lists/arrays and slices (and a few more—[read the docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)!)

For data frames we follow `.loc` and `.iloc` with a pair of square brackets, containing either one or two inputs. If one input is used, this subsets the rows. If two are used, they subset the rows and columns respectively.

A colon (`:`) can be used to include all rows in that dimension.

In [None]:
# 10th row of the dataset


In [17]:
# 2nd to last column
#iloc is integer based approach
wine.iloc[:,-2]

0        9.4
1        9.8
2        9.8
3        9.8
4        9.4
        ... 
1594    10.5
1595    11.2
1596    11.0
1597    10.2
1598    11.0
Name: alcohol, Length: 1599, dtype: float64

In [13]:
# 4th row, 7th column
wine.iloc[3,6]

60.0

In [16]:
# Column means
wine.describe().iloc[1]

fixed acidity            1.741096
volatile acidity         0.179060
citric acid              0.194801
residual sugar           1.409928
chlorides                0.047065
free sulfur dioxide     10.460157
total sulfur dioxide    32.895324
density                  0.001887
pH                       0.154386
sulphates                0.169507
alcohol                  1.065668
quality                  0.807569
Name: std, dtype: float64

In [19]:
# Acidity markers for every 10th row
#::10 - every 10th row, [] - array of labels
wine.iloc[::10].loc[:, ['fixed acidity', 'volatile acidity', 'citric acid', 'pH']]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,pH
0,7.4,0.700,0.00,3.51
10,6.7,0.580,0.08,3.28
20,8.9,0.220,0.48,3.39
30,6.7,0.675,0.07,3.35
40,7.3,0.450,0.36,3.33
...,...,...,...,...
1550,7.1,0.680,0.00,3.45
1560,7.8,0.600,0.26,3.21
1570,6.4,0.360,0.53,3.37
1580,7.4,0.350,0.33,3.36


Notice that when using a list to subset columns we obtain a dataframe in return. This holds even if the list has one element.

In [21]:
# 4th row as dataframe
wine.iloc[[5]]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
5,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5


In [23]:
# (4, 2) element as dataframe
wine.iloc[[4,2]]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5


Be careful, unlike with integer slices, labels slices (loc) include the final value

In [24]:
wine.loc[:, 'density':'quality']

Unnamed: 0,density,pH,sulphates,alcohol,quality
0,0.99780,3.51,0.56,9.4,5
1,0.99680,3.20,0.68,9.8,5
2,0.99700,3.26,0.65,9.8,5
3,0.99800,3.16,0.58,9.8,6
4,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...
1594,0.99490,3.45,0.58,10.5,5
1595,0.99512,3.52,0.76,11.2,6
1596,0.99574,3.42,0.75,11.0,6
1597,0.99547,3.57,0.71,10.2,5


We can also use these methods for setting values.

In [35]:
df = pd.DataFrame({
    'x': [1, 2, 3],
    'y': [4, 5, 6]
})
df

Unnamed: 0,x,y
0,1,4
1,2,5
2,3,6


In [36]:
# Change 1 to -1
df.iloc[0,0]= -1
df

Unnamed: 0,x,y
0,-1,4
1,2,5
2,3,6


In [37]:
# Double second row
df.iloc[1] *=2
df

Unnamed: 0,x,y
0,-1,4
1,4,10
2,3,6


We can also extract columns using regular square brackets (just like a list or dictionary) using label-based indexing.

In [43]:
# pH column
wine['pH']

0       3.51
1       3.20
2       3.26
3       3.16
4       3.51
        ... 
1594    3.45
1595    3.52
1596    3.42
1597    3.57
1598    3.39
Name: pH, Length: 1599, dtype: float64

### Series

Unless we force a dataframe to be return using one-element lists, pandas will return either a single value, a series or a new dataframe depending on whether our result is 0, 1, or 2-dimensional.

In [44]:
type(wine['pH'])
# in this case, 1 dimensional - series
# if single value, it is 0 dimensional - single value


pandas.core.series.Series

A series is a one-dimensional array with axis labels. We can use `.loc` and `iloc` on series but only need to specify a single input. We can also use standard square brackets using either labels or integers.

It is important to note that subsetting in pandas copies by reference, not by value (unless you use the `.copy()` method).

In [45]:
df = pd.DataFrame({
    'x': [1, 2, 3],
    'y': [4, 5, 6]
})

x = df['x']
y = df['y'].copy()

x[1] = 0
y[1] = 0
#here, y is not changed as it is called by value; x is called by reference
df

Unnamed: 0,x,y
0,1,4
1,0,5
2,3,6


> See also: `.at` and `.iat` in [the docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iat.html)

### Filtering

`loc`, `iloc` and `[]` can also accept Boolean vectors, returning only rows/columns that correspond to a true value

In [61]:
# Select rows with ph greater than 2.9 & volatile acidity < 9.2
wine.loc[(wine['pH']>2.9)&(wine['volatile acidity']<9.2)]


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


In [58]:
# Select only decimal columns
wine.loc[:,wine.dtypes=='int64']

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4
...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2


A useful helper is the `isin()` series method.

In [65]:
# Select wines of quality 3, 5, 6
wine.loc[wine['quality'].isin([3,5,6])]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


You can combine Boolean vectors using Boolean operators. The notation we use in pandas is different to in base Python however:
- Use `&` for `and`
- Use `|` for `or`
- Use `~` for `not`

We could use this to drop columns with certain names (I'll leave this as a puzzle), but there is a better way using the `.drop` method.

In [66]:
wine.drop(['citric acid', 'residual sugar'], axis=1)
wine

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


## Data Manipulation

### Sorting

We can sort a pandas dataframe using the `sort_values` method. This sorts either columns or rows depending on the specified axis. If a single label is provided the dataframe is sorted using that row/column. If a list is provided, the latter labels are used to break ties.

In [67]:
# Sort first by quality then by alcohol
wine.sort_values(['quality','alcohol'])
#default is ascending

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
517,10.4,0.610,0.49,2.1,0.200,5.0,16.0,0.99940,3.16,0.63,8.4,3
459,11.6,0.580,0.66,2.2,0.074,10.0,47.0,1.00080,3.25,0.57,9.0,3
1469,7.3,0.980,0.05,2.1,0.061,20.0,49.0,0.99705,3.31,0.55,9.7,3
1374,6.8,0.815,0.00,1.2,0.267,16.0,29.0,0.99471,3.32,0.51,9.8,3
832,10.4,0.440,0.42,1.5,0.145,34.0,48.0,0.99832,3.38,0.86,9.9,3
...,...,...,...,...,...,...,...,...,...,...,...,...
390,5.6,0.850,0.05,1.4,0.045,12.0,88.0,0.99240,3.56,0.82,12.9,8
1120,7.9,0.540,0.34,2.5,0.076,8.0,17.0,0.99235,3.20,0.72,13.1,8
455,11.3,0.620,0.67,5.2,0.086,6.0,19.0,0.99880,3.22,0.69,13.4,8
588,5.0,0.420,0.24,2.0,0.060,19.0,50.0,0.99170,3.72,0.74,14.0,8


In [68]:
# Sort by descending density
wine.sort_values(['quality','alcohol'],ascending=False)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
588,5.0,0.420,0.24,2.0,0.060,19.0,50.0,0.99170,3.72,0.74,14.0,8
1269,5.5,0.490,0.03,1.8,0.044,28.0,87.0,0.99080,3.50,0.82,14.0,8
455,11.3,0.620,0.67,5.2,0.086,6.0,19.0,0.99880,3.22,0.69,13.4,8
1120,7.9,0.540,0.34,2.5,0.076,8.0,17.0,0.99235,3.20,0.72,13.1,8
390,5.6,0.850,0.05,1.4,0.045,12.0,88.0,0.99240,3.56,0.82,12.9,8
...,...,...,...,...,...,...,...,...,...,...,...,...
832,10.4,0.440,0.42,1.5,0.145,34.0,48.0,0.99832,3.38,0.86,9.9,3
1374,6.8,0.815,0.00,1.2,0.267,16.0,29.0,0.99471,3.32,0.51,9.8,3
1469,7.3,0.980,0.05,2.1,0.061,20.0,49.0,0.99705,3.31,0.55,9.7,3
459,11.6,0.580,0.66,2.2,0.074,10.0,47.0,1.00080,3.25,0.57,9.0,3


> See the `key` parameter in [the docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) for more flexible sorts

> Most of the methods we've used come with an `inplace` parameter, which when set to `True` will perform the operation directly on the data rather than returning a modified data frame. This is useful is some cases but prevents you from chaining together methods.

### Creating and Overwriting Columns

We can create new columns using square brackets, providing a column name that doesn't already exist. If the column does exist, it's value will be overwritten.

Operations are _vectorised_ meaning they act on an element-by-element basis.

In [72]:
# Calculate non-free sulfur dioxide
#total SO2 - free SO2
wine['non-free sulfur dioxide'] = wine['total sulfur dioxide'] - wine['free sulfur dioxide']
wine.loc[:,'non-free sulfur dioxide']

0       23.0
1       42.0
2       39.0
3       43.0
4       23.0
        ... 
1594    12.0
1595    12.0
1596    11.0
1597    12.0
1598    24.0
Name: non-free sulfur dioxide, Length: 1599, dtype: float64

In [77]:
# Replace density with grams/litre
wine['density'] = wine['density']*1000
wine['density']

0       997800.0
1       996800.0
2       997000.0
3       998000.0
4       997800.0
          ...   
1594    994900.0
1595    995120.0
1596    995740.0
1597    995470.0
1598    995490.0
Name: density, Length: 1599, dtype: float64

If a single value is used, it will fill the entire column.

In [79]:
# Add column of zeros
wine['Zeros'] = 0
wine['Zeros']

0       0
1       0
2       0
3       0
4       0
       ..
1594    0
1595    0
1596    0
1597    0
1598    0
Name: Zeros, Length: 1599, dtype: int64

### Creating Summarises

Pandas allows you to create summaries of rows, columns or series. Some common methods for this are `mean`, `min`, `max`, `median`, `mode`, `std`, `var`, `sum`. These are more useful when we have grouped data, which we will introduce in the project session.

In [83]:
# Average of all columns
wine.mean()


0       997800.0
1       996800.0
2       997000.0
3       998000.0
4       997800.0
          ...   
1594    994900.0
1595    995120.0
1596    995740.0
1597    995470.0
1598    995490.0
Length: 1599, dtype: float64

In [None]:
# Maximum value of each row
wine.max(axis=1) #max value in each row

In [84]:
# Standard deviation of pH
wine['pH'].std()

0.15438646490354277

Pandas also offers two useful Boolean reduction functions, `all` and `any`, return `True` if all or any of the values in the series they are applied to is `True`, respectively. They can also be applied to dataframes, in which case they act on each column independently.

In [87]:
# Are any pH values less than 3?
(wine['pH']<3).any()

1599

In [None]:
# Are are values in the dataset non-negative?


Recalling back to Beginner's Python, we saw that `True`/`False` convert to 1/0 when cast as integers. We can use this to count and obtain proportions of true values.

In [None]:
# How many 5-quality wines are there?


In [None]:
# What proportion of wines are 5-quality?


## Wrapping Up

### Writing to CSVs

We can write a dataframe to a CSV using the `to_csv` method, passing in a file path. There are many parameters found in the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html), but the most commonly used is `index=False` to avoid saving row numbers (which can make it harder for other programs to read).

In [88]:
wine.sample(5).to_csv('wine_sample.csv', index=False)

Note, this will overwrite any existing file without warning.

### Other IO tools

Pandas is capable of reading from and writing to a large number of of file types. A list a corresponding documentation can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).