05.10.2022 - Added compare rows
#Try it first like this:
#rapidfuzz is a lot faster than fuzzywuzzy, but I had some problems installing it, #even with Visual C++ 2019 redistributable installed a-pandas-ex-fuzz will try to import this module first
pip install a-pandas-ex-plode-tool
pip install a-pandas-ex-df-to-string
pip install rapidfuzz #https://github.com/maxbachmann/RapidFuzz
pip install --no-deps a-pandas-ex-fuzz
#if rapidfuzz does not work, use:
pip install a-pandas-ex-plode-tool
pip install a-pandas-ex-df-to-string
pip install fuzzywuzzy
pip install --no-deps a-pandas-ex-fuzz
#Or if you want to try to install everything:
pip install a-pandas-ex-fuzz
from a_pandas_ex_fuzz import pd_add_fuzzy_matching
pd_add_fuzzy_matching() #adds three new methods to pd.
import pandas as pd
df = pd.read_csv(
"https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
)
df11 = df.Name.s_fuzz_all_values_in_column(
limit=5, merge_with_series=True, partial_full_weighted="weighted"
)
df22 = df.Name.s_fuzz_all_values_in_column(
limit=2, merge_with_series=False, partial_full_weighted="full"
)
df33 = df.Name.s_fuzz_all_values_in_column(
limit=1, merge_with_series=True, partial_full_weighted="partial"
)
df22
0 Braund... 70.833333 477 Cann, ... 63.829787
1 Angle,... 55.445545 518 Astor,... 53.061224
2 Sinkko... 79.069767 747 Honkan... 77.272727
3 Futrel... 77.142857 137 Potter... 52.873563
4 Gilles... 84.615385 722 Saunde... 77.777778
5 Bracke... 77.777778 221 Scanla... 76.470588
6 O'Brie... 65.116279 552 Maisne... 58.536585
7 Goodwi... 68.852459 386 Palsso... 67.857143
8 Rosblo... 62.068966 254 Hockin... 59.52381
9 Nasser... 74.074074 122 Astor,... 58.536585
fuzz_index_1
0 37
1 700
2 216
3 879
4 12
5 468
6 464
7 374
8 774
9 700
Parameters:
df: [pd.Series]
limit: int
How many results do you want to have?
Each result will have 3 columns [string, match, position in column]
(default=5)
partial_full_weighted: str
weighted = fuzz.WRatio
full = fuzz.ratio
partial = fuzz.partial_ratio
(default="weighted")
merge_with_series: str
(default=True)
Returns:
pd.DataFrame
from a_pandas_ex_fuzz import pd_add_fuzzy_matching
pd_add_fuzzy_matching() #adds three new methods to pd.
import pandas as pd
df = pd.read_csv(
"https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
)
df111 = df.Name.s_fuzz_from_list(
list_to_compare=["Johannes", "Paulo", "Kevin"],
limit=2,
merge_with_series=True,
partial_full_weighted="partial",
)
df222 = df.Name.s_fuzz_from_list(
list_to_compare=["John", "Johannes", "Paulo", "Kevin"],
limit=3,
merge_with_series=False,
partial_full_weighted="full",
)
df333 = df.Name.s_fuzz_from_list(
list_to_compare=["Maria", "Anna"],
limit=1,
merge_with_series=False,
partial_full_weighted="partial",
)
df333
fuzz_string_0 fuzz_match_0 fuzz_index_0
0 Maria 60.0 0
1 Maria 44.444444 0
2 Anna 75.0 1
3 Maria 40.0 0
4 Maria 40.0 0
.. ... ... ...
886 Maria 40.0 0
887 Maria 80.0 0
888 Maria 60.0 0
889 Maria 40.0 0
890 Maria 60.0 0
[891 rows x 3 columns]
Parameters:
df: [pd.Series]
list_to_compare: list
The strings you want to be compared
limit: int
How many results do you want to have?
Each result will have 3 columns [string, match, position in column]
(default=5)
partial_full_weighted: str
weighted = fuzz.WRatio
full = fuzz.ratio
partial = fuzz.partial_ratio
(default="weighted")
merge_with_series: str
(default=True)
Returns:
pd.DataFrame
from a_pandas_ex_fuzz import pd_add_fuzzy_matching
pd_add_fuzzy_matching() #adds three new methods to pd.
import pandas as pd
df = pd.read_csv(
"https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv"
)
df1 = df.Name.s_fuzz_one_word(
word_to_search="Karolina", partial_full_weighted="weighted"
)
df2 = df.Name.s_fuzz_one_word(word_to_search="Karolina", partial_full_weighted="full")
df3 = df.Name.s_fuzz_one_word(
word_to_search="Karolina", partial_full_weighted="partial"
)
df1
Name fuzz_string_0 \
0 Braund, Mr. Owen Harris Karolina
1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) Karolina
2 Heikkinen, Miss. Laina Karolina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) Karolina
4 Allen, Mr. William Henry Karolina
5 Moran, Mr. James Karolina
6 McCarthy, Mr. Timothy J Karolina
7 Palsson, Master. Gosta Leonard Karolina
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) Karolina
9 Nasser, Mrs. Nicholas (Adele Achem) Karolina
fuzz_match_0
0 41.538462
1 33.750000
2 60.000000
3 33.750000
4 42.750000
5 30.000000
6 27.692308
7 45.000000
8 45.600000
9 42.750000
df2
Name fuzz_string_0 \
0 Braund, Mr. Owen Harris Karolina
1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) Karolina
2 Heikkinen, Miss. Laina Karolina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) Karolina
4 Allen, Mr. William Henry Karolina
5 Moran, Mr. James Karolina
6 McCarthy, Mr. Timothy J Karolina
7 Palsson, Master. Gosta Leonard Karolina
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) Karolina
9 Nasser, Mrs. Nicholas (Adele Achem) Karolina
fuzz_match_0
0 32.258065
1 17.241379
2 33.333333
3 15.686275
4 31.250000
5 25.000000
6 19.354839
7 31.578947
8 21.428571
9 23.809524
df3
Name fuzz_string_0 \
0 Braund, Mr. Owen Harris Karolina
1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) Karolina
2 Heikkinen, Miss. Laina Karolina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) Karolina
4 Allen, Mr. William Henry Karolina
5 Moran, Mr. James Karolina
6 McCarthy, Mr. Timothy J Karolina
7 Palsson, Master. Gosta Leonard Karolina
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) Karolina
9 Nasser, Mrs. Nicholas (Adele Achem) Karolina
fuzz_match_0
0 46.153846
1 37.500000
2 66.666667
3 37.500000
4 46.153846
5 33.333333
6 30.769231
7 50.000000
8 50.000000
9 40.000000
Parameters:
df: [pd.Series]
word_to_search: str
partial_full_weighted: str
weighted = fuzz.WRatio
full = fuzz.ratio
partial = fuzz.partial_ratio
(default="weighted")
Returns:
pd.DataFrame
from a_pandas_ex_fuzz import pd_add_fuzzy_matching
pd_add_fuzzy_matching()
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv")
df.ds_fuzz_compare_row_to_others(2,loc_or_iloc='iloc', partial_full_weighted='full', sort_values=True)
Out[4]:
PassengerId Survived Pclass ... Cabin Embarked aa_fuzz_match
2 3 1 3 ... NaN S 100.000000
216 217 1 3 ... NaN S 90.816327
816 817 0 3 ... NaN S 88.118812
382 383 0 3 ... NaN S 83.769634
400 401 1 3 ... NaN S 83.769634
.. ... ... ... ... ... ... ...
745 746 0 1 ... B22 S 54.450262
556 557 1 1 ... A16 C 53.744493
581 582 1 1 ... C68 C 53.456221
669 670 1 1 ... C126 S 52.132701
307 308 1 1 ... C65 C 51.612903
[891 rows x 13 columns]