Is pandas really faster? Well the answer is most of the time. A bunch of the more advanced python librabries like numpy and pandas use vectorisation the speed up their functions. In vectorisation, instead of running through the values one by one, we perform the same operation on a bunch of values saved as a vector. This is possible because of modern CPUs single instruction, multiple data (SIMD) processing. Traditionally processors could only hold one value and thus had to process one value at a time. Modern processors can hold multiple values and therefore run the same operation on a vector of values if you program it to do so. Let's now demonstrate this is a simple example comparing a columns of random data:

Let's use a simple example and compare the values from 2 columns in three different ways:

`df[fd.A != df.B]`  pandas using vectorisation

`df.query('A != B')` uses numexpr library

`df[[x != y for x, y in zip(df.A, df.B)]]` for loop (list comprehension)

![alt text](https://raw.githubusercontent.com/kraikisto/CERN_LEP_Z_boson/main/comparison.png)

Here instead of a traditional loop we use something called list comprehension. It's built into python and functions just like a normal for loop, but has some improvements. We can also compare it to a basic for loop: 

![alt text](https://raw.githubusercontent.com/kraikisto/CERN_LEP_Z_boson/main/comparison_normal_loop.png)

Interestingly if we compare to another vectorised method from numpy we see that it is even faster. It seems that specifically `.values` is signifantly better than its pandas counterpart.

`df[df.A.values != df.B.values]` 

![alt text](https://raw.githubusercontent.com/kraikisto/CERN_LEP_Z_boson/main/comparison_numpy.png)

You may notice that all of them are more or less straight lines. As we previously covered, loop over a dataset is of time complexity $O(N)$ meaning linear. This also stands for their vectorised counterparts. The vectorisation itself doesn't change their time complexity, but does affect the slope of the graph. 

Vectorisation isn't always better. If the data isn't numbers, but objects or strings it is much harder to vectorise. Also if the datatypes are mixed it slows down vectorisation significantly. Here is a demonstration of mixed datasets: 

`df.query('A != B')` uses numexpr library

`df[df.PT1 != df.PT2]`  pandas using vectorisation

`df[[x != y for x, y in zip(df.PT1, df.PT2)]]` normal for loop (list comprehension)

![alt text](https://raw.githubusercontent.com/kraikisto/CERN_LEP_Z_boson/main/mixed_comparison.png)

Notes: 

-Pandas code here is much easier to read than the for loop. 

-In pandas some functions don't use vectorisation. `iterrows()` and `apply()` both lose to list comprehension. `apply()` is not shown on the graphs because its so slow it would be hard to see the the differences of the other ones. 

-I took the plotting code straight form the internet so if this gets officially used citing would probably be appropriate. Also maybe could change a few things about it, like comparing to query seems a bit pointless.

Code used for plots: 

```python
#this is the code specifically for the plot with the normal for loop included

def normal_loop(df):
    list1 = list(df.A)
   list2 = list(df.B)
    for i in range(0,len(list1)):
        if list1[i] == list2[i]:
            df = df.drop(index = i)
    return df

#this uses perfplot, a tool specifically made for performance testing
perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.choice(1000, (n, 2)), columns=['A','B']),
    kernels=[
        lambda df: df[df.A != df.B],
        lambda df: df[[x != y for x, y in zip(df.A, df.B)]],
        lambda df: normal_loop(df),
    ],
    labels=['vectorized !=', 'list comp', "normal loop"],
    n_range=[2**k for k in range(0, 20)],
    xlabel='N',
    logy="False",
    logx="False"
)
```